Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Fetcher ¶
type Fetcher struct {
// contains filtered or unexported fields
}
Fetcher handles making HTTP requests with configured retry logic, using an underlying http.Client
func NewFetcher ¶
NewFetcher creates a new Fetcher instance
func (*Fetcher) FetchWithRetry ¶
FetchWithRetry performs an HTTP request associated with the provided context It implements a retry mechanism with exponential backoff and jitter for transient network errors and specific HTTP status codes (5xx, 429)
type HTTPFetcher ¶ added in v1.3.0
type HTTPFetcher interface {
FetchWithRetry(req *http.Request, ctx context.Context) (*http.Response, error)
}
HTTPFetcher is the interface for performing HTTP requests with retry logic.
type HostSemaphorePool ¶
type HostSemaphorePool struct {
// contains filtered or unexported fields
}
HostSemaphorePool manages per-host semaphores for rate limiting concurrent requests to each host. A single pool should be shared across all components (crawler, image processor) so the per-host limit is enforced globally.
func NewHostSemaphorePool ¶
func NewHostSemaphorePool(maxPerHost int, log *logrus.Entry) *HostSemaphorePool
NewHostSemaphorePool creates a new pool with the given per-host concurrency limit.
func (*HostSemaphorePool) Acquire ¶ added in v1.3.0
func (p *HostSemaphorePool) Acquire(ctx context.Context, host string) error
Acquire gets or creates a host semaphore and acquires one permit. Blocks until the permit is available or ctx is cancelled.
func (*HostSemaphorePool) Len ¶ added in v1.3.0
func (p *HostSemaphorePool) Len() int
Len returns the current number of tracked hosts.
func (*HostSemaphorePool) Release ¶ added in v1.3.0
func (p *HostSemaphorePool) Release(host string)
Release releases one permit for the given host.
func (*HostSemaphorePool) RunEviction ¶ added in v1.3.0
func (p *HostSemaphorePool) RunEviction(ctx context.Context, interval time.Duration)
RunEviction periodically removes idle host entries. Should be run in a goroutine.
type RateLimiter ¶
type RateLimiter struct {
// contains filtered or unexported fields
}
RateLimiter manages request timing per host for politeness
func NewRateLimiter ¶
func NewRateLimiter(defaultDelay time.Duration, log *logrus.Entry) *RateLimiter
NewRateLimiter creates a RateLimiter
func (*RateLimiter) ApplyDelay ¶
ApplyDelay sleeps if the time since the last request to the host is less than minDelay Includes jitter (+/- 10%) to desynchronize requests
func (*RateLimiter) UpdateLastRequestTime ¶
func (rl *RateLimiter) UpdateLastRequestTime(host string)
UpdateLastRequestTime records the current time as the last request attempt time for the host Call this *after* an HTTP request attempt to the host
type RobotsHandler ¶
type RobotsHandler struct {
// contains filtered or unexported fields
}
RobotsHandler manages fetching, parsing, caching, and checking robots.txt data
func NewRobotsHandler ¶
func NewRobotsHandler( fetcher HTTPFetcher, rateLimiter *RateLimiter, globalSemaphore *semaphore.Weighted, sitemapNotifier SitemapDiscoverer, cfg *config.AppConfig, log *logrus.Entry, ) *RobotsHandler
NewRobotsHandler creates a RobotsHandler
func (*RobotsHandler) GetRobotsData ¶
func (rh *RobotsHandler) GetRobotsData(targetURL *url.URL, signalChan chan<- bool, ctx context.Context) *robotstxt.RobotsData
GetRobotsData retrieves robots.txt data for the targetURL's host, using cache or fetching Returns parsed data or nil on any error/4xx/missing file signalChan is only for coordinating the initial crawler startup fetch
type SitemapDiscoverer ¶
type SitemapDiscoverer interface {
FoundSitemap(sitemapURL string)
}
SitemapDiscoverer defines the callback interface for handling discovered sitemap URLs