fetch

package
v1.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewClient

func NewClient(cfg config.HTTPClientConfig, log *logrus.Entry) *http.Client

NewClient creates a new HTTP client based on the provided configuration.

Types

type Fetcher

type Fetcher struct {
	// contains filtered or unexported fields
}

Fetcher handles making HTTP requests with configured retry logic, using an underlying http.Client

func NewFetcher

func NewFetcher(client *http.Client, cfg *config.AppConfig, log *logrus.Entry) *Fetcher

NewFetcher creates a new Fetcher instance

func (*Fetcher) FetchWithRetry

func (f *Fetcher) FetchWithRetry(req *http.Request, ctx context.Context) (*http.Response, error)

FetchWithRetry performs an HTTP request associated with the provided context It implements a retry mechanism with exponential backoff and jitter for transient network errors and specific HTTP status codes (5xx, 429)

type HTTPFetcher added in v1.3.0

type HTTPFetcher interface {
	FetchWithRetry(req *http.Request, ctx context.Context) (*http.Response, error)
}

HTTPFetcher is the interface for performing HTTP requests with retry logic.

type HostSemaphorePool

type HostSemaphorePool struct {
	// contains filtered or unexported fields
}

HostSemaphorePool manages per-host semaphores for rate limiting concurrent requests to each host. A single pool should be shared across all components (crawler, image processor) so the per-host limit is enforced globally.

func NewHostSemaphorePool

func NewHostSemaphorePool(maxPerHost int, log *logrus.Entry) *HostSemaphorePool

NewHostSemaphorePool creates a new pool with the given per-host concurrency limit.

func (*HostSemaphorePool) Acquire added in v1.3.0

func (p *HostSemaphorePool) Acquire(ctx context.Context, host string) error

Acquire gets or creates a host semaphore and acquires one permit. Blocks until the permit is available or ctx is cancelled.

func (*HostSemaphorePool) Len added in v1.3.0

func (p *HostSemaphorePool) Len() int

Len returns the current number of tracked hosts.

func (*HostSemaphorePool) Release added in v1.3.0

func (p *HostSemaphorePool) Release(host string)

Release releases one permit for the given host.

func (*HostSemaphorePool) RunEviction added in v1.3.0

func (p *HostSemaphorePool) RunEviction(ctx context.Context, interval time.Duration)

RunEviction periodically removes idle host entries. Should be run in a goroutine.

type RateLimiter

type RateLimiter struct {
	// contains filtered or unexported fields
}

RateLimiter manages request timing per host for politeness

func NewRateLimiter

func NewRateLimiter(defaultDelay time.Duration, log *logrus.Entry) *RateLimiter

NewRateLimiter creates a RateLimiter

func (*RateLimiter) ApplyDelay

func (rl *RateLimiter) ApplyDelay(ctx context.Context, host string, minDelay time.Duration)

ApplyDelay sleeps if the time since the last request to the host is less than minDelay Includes jitter (+/- 10%) to desynchronize requests

func (*RateLimiter) UpdateLastRequestTime

func (rl *RateLimiter) UpdateLastRequestTime(host string)

UpdateLastRequestTime records the current time as the last request attempt time for the host Call this *after* an HTTP request attempt to the host

type RobotsHandler

type RobotsHandler struct {
	// contains filtered or unexported fields
}

RobotsHandler manages fetching, parsing, caching, and checking robots.txt data

func NewRobotsHandler

func NewRobotsHandler(
	fetcher HTTPFetcher,
	rateLimiter *RateLimiter,
	globalSemaphore *semaphore.Weighted,
	sitemapNotifier SitemapDiscoverer,
	cfg *config.AppConfig,
	log *logrus.Entry,
) *RobotsHandler

NewRobotsHandler creates a RobotsHandler

func (*RobotsHandler) GetRobotsData

func (rh *RobotsHandler) GetRobotsData(targetURL *url.URL, signalChan chan<- bool, ctx context.Context) *robotstxt.RobotsData

GetRobotsData retrieves robots.txt data for the targetURL's host, using cache or fetching Returns parsed data or nil on any error/4xx/missing file signalChan is only for coordinating the initial crawler startup fetch

func (*RobotsHandler) TestAgent

func (rh *RobotsHandler) TestAgent(targetURL *url.URL, userAgent string, ctx context.Context) bool

TestAgent checks if the user agent is allowed access based on cached/fetched rules Returns true if allowed (or robots fetch/parse fails), false otherwise

type SitemapDiscoverer

type SitemapDiscoverer interface {
	FoundSitemap(sitemapURL string)
}

SitemapDiscoverer defines the callback interface for handling discovered sitemap URLs

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL