crawler

package
v0.0.0-...-9e1f9fd Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 4, 2025 License: MIT Imports: 13 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Collector

func Collector(ctx context.Context, url string, projectPath string, cookieJar *cookiejar.Jar, proxyString string, userAgent string) error

Collector searches for css, js, and images within a given link TODO improve for better performance

func CollectorWithSizeLimit

func CollectorWithSizeLimit(ctx context.Context, url string, projectPath string, cookieJar *cookiejar.Jar, proxyString string, userAgent string, maxFolderSize int64) error

CollectorWithSizeLimit 带大小限制的收集器

func Crawl

func Crawl(ctx context.Context, site string, projectPath string, cookieJar *cookiejar.Jar, proxyString string, userAgent string) error

Crawl asks the necessary crawlers for collecting links for building the web page

func CrawlWithConfig

func CrawlWithConfig(ctx context.Context, site string, projectPath string, cookieJar *cookiejar.Jar, config CrawlConfig) error

CrawlWithConfig 使用配置对象进行爬取,支持大小检查

func Extractor

func Extractor(link string, projectPath string)

Extractor visits a link determines if its a page or sublink downloads the contents to a correct directory in project folder TODO add functionality for determining if page or sublink

func HTMLExtractor

func HTMLExtractor(link string, projectPath string)

HTMLExtractor ...

func HTMLExtractorFromResponse

func HTMLExtractorFromResponse(link string, projectPath string, bodyData []byte)

HTMLExtractorFromResponse 从colly响应中提取HTML内容

Types

type CrawlConfig

type CrawlConfig interface {
	GetProxyString() string
	GetUserAgent() string
	GetMaxFolderSize() int64
}

CrawlConfig 爬取配置接口

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL