fetch

package
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 16, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewClient

func NewClient(cfg config.HTTPClientConfig, log *logrus.Logger) *http.Client

NewClient creates a new HTTP client based on the provided configuration.

Types

type Fetcher

type Fetcher struct {
	// contains filtered or unexported fields
}

Fetcher handles making HTTP requests with configured retry logic, using an underlying http.Client

func NewFetcher

func NewFetcher(client *http.Client, cfg *config.AppConfig, log *logrus.Logger) *Fetcher

NewFetcher creates a new Fetcher instance

func (*Fetcher) FetchWithRetry

func (f *Fetcher) FetchWithRetry(req *http.Request, ctx context.Context) (*http.Response, error)

FetchWithRetry performs an HTTP request associated with the provided context It implements a retry mechanism with exponential backoff and jitter for transient network errors and specific HTTP status codes (5xx, 429)

type HostSemaphorePool

type HostSemaphorePool struct {
	// contains filtered or unexported fields
}

HostSemaphorePool manages per-host semaphores for rate limiting concurrent requests to each host. A single pool should be shared across all components (crawler, image processor) so the per-host limit is enforced globally.

func NewHostSemaphorePool

func NewHostSemaphorePool(maxPerHost int, log *logrus.Entry) *HostSemaphorePool

NewHostSemaphorePool creates a new pool with the given per-host concurrency limit.

func (*HostSemaphorePool) Get

Get retrieves or creates a semaphore for the given host.

type RateLimiter

type RateLimiter struct {
	// contains filtered or unexported fields
}

RateLimiter manages request timing per host for politeness

func NewRateLimiter

func NewRateLimiter(defaultDelay time.Duration, log *logrus.Logger) *RateLimiter

NewRateLimiter creates a RateLimiter

func (*RateLimiter) ApplyDelay

func (rl *RateLimiter) ApplyDelay(host string, minDelay time.Duration)

ApplyDelay sleeps if the time since the last request to the host is less than minDelay Includes jitter (+/- 10%) to desynchronize requests

func (*RateLimiter) UpdateLastRequestTime

func (rl *RateLimiter) UpdateLastRequestTime(host string)

UpdateLastRequestTime records the current time as the last request attempt time for the host Call this *after* an HTTP request attempt to the host

type RobotsHandler

type RobotsHandler struct {
	// contains filtered or unexported fields
}

RobotsHandler manages fetching, parsing, caching, and checking robots.txt data

func NewRobotsHandler

func NewRobotsHandler(
	fetcher *Fetcher,
	rateLimiter *RateLimiter,
	globalSemaphore *semaphore.Weighted,
	sitemapNotifier SitemapDiscoverer,
	cfg *config.AppConfig,
	log *logrus.Entry,
) *RobotsHandler

NewRobotsHandler creates a RobotsHandler

func (*RobotsHandler) GetRobotsData

func (rh *RobotsHandler) GetRobotsData(targetURL *url.URL, signalChan chan<- bool, ctx context.Context) *robotstxt.RobotsData

GetRobotsData retrieves robots.txt data for the targetURL's host, using cache or fetching Returns parsed data or nil on any error/4xx/missing file signalChan is only for coordinating the initial crawler startup fetch

func (*RobotsHandler) TestAgent

func (rh *RobotsHandler) TestAgent(targetURL *url.URL, userAgent string, ctx context.Context) bool

TestAgent checks if the user agent is allowed access based on cached/fetched rules Returns true if allowed (or robots fetch/parse fails), false otherwise

type SitemapDiscoverer

type SitemapDiscoverer interface {
	FoundSitemap(sitemapURL string)
}

SitemapDiscoverer defines the callback interface for handling discovered sitemap URLs

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL