crawler

package
v1.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 30, 2025 License: MIT Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type AntiCrawl

type AntiCrawl struct {
	// contains filtered or unexported fields
}

AntiCrawl is the anti-crawl strategy manager 反爬策略管理器 Encapsulates anti-crawl mechanisms such as rate limiting, random request headers, delay control 封装速率限制、随机请求头、延迟控制等反爬机制

func NewAntiCrawl

func NewAntiCrawl(cfg *config.SteamConfig) *AntiCrawl

NewAntiCrawl creates anti-crawl strategy instance 创建反爬策略实例

func (*AntiCrawl) Apply

func (a *AntiCrawl) Apply(c *colly.Collector)

Apply 应用反爬策略到 Colly 采集器 Apply anti-crawl rules to colly collector instance

type Parser

type Parser struct{}

Parser is the HTML parser | HTML 解析器 Encapsulates goquery parsing logic and provides universal HTML parsing and text cleaning capabilities 封装 goquery 解析逻辑, 提供通用的 HTML 解析和文本清理能力

func NewParser

func NewParser() *Parser

NewParser creates Parser instance | 创建 HTML 解析器实例

func (*Parser) CleanText

func (p *Parser) CleanText(text string) string

CleanText 清理解析后的文本 去除首尾空格和换行符, 提升文本可读性 参数:

  • text: 原始解析文本 | Raw parsed text

返回值:

  • string: 清理后的文本 | Cleaned text

func (*Parser) ParseHTML

func (p *Parser) ParseHTML(html []byte, parseFn func(doc *goquery.Document) error) error

ParseHTML 解析原始 HTML 字节流 支持自定义解析逻辑, 灵活适配不同页面结构 参数:

  • html: 原始 HTML 字节流 | Raw HTML byte stream
  • parseFn: 自定义解析函数(接收 goquery 文档实例) | Custom parse function (receives goquery document)

返回值:

  • error: 解析失败时返回错误 | Error if parsing fails

type ProxyRotator

type ProxyRotator struct {
	// contains filtered or unexported fields
}

ProxyRotator 代理轮换管理器 支持轮询/随机两种代理选择策略, 提供动态代理池管理能力 ProxyRotator is the proxy rotation manager Supports round-robin/random proxy selection strategies and provides dynamic proxy pool management

func NewProxyRotator

func NewProxyRotator(cfg *config.SteamConfig) *ProxyRotator

NewProxyRotator 创建代理轮换实例 参数:

  • cfg: 全局配置实例 | Global configuration instance

返回值:

  • *ProxyRotator: 代理轮换实例 | Proxy rotation instance

func (*ProxyRotator) AddProxy

func (p *ProxyRotator) AddProxy(proxyAddr string)

AddProxy 动态添加代理 参数:

  • proxyAddr: 代理地址 | Proxy address

func (*ProxyRotator) GetProxyFunc

func (p *ProxyRotator) GetProxyFunc() func(r *http.Request) (*url.URL, error)

GetProxyFunc 返回 Colly 兼容的 ProxyFunc 每次请求调用时自动选择代理, 支持失败兜底 返回值:

  • func(r *http.Request) (*url.URL, error): Colly 代理函数 | Colly proxy function

func (*ProxyRotator) Pool

func (p *ProxyRotator) Pool() []string

Pool 返回当前代理池 返回值:

  • []string: 代理地址列表 | Proxy address list

func (*ProxyRotator) RemoveProxy

func (p *ProxyRotator) RemoveProxy(proxyAddr string)

RemoveProxy 动态移除代理 参数:

  • proxyAddr: 代理地址 | Proxy address

type Storage

type Storage struct {
	// contains filtered or unexported fields
}

Storage HTML 存储管理器 按日期分目录存储爬取的 HTML 文件, 确保文件组织规范 Storage is the HTML storage manager Stores crawled HTML files by date directory to ensure standardized file organization

func NewStorage

func NewStorage(baseDir string) *Storage

NewStorage 创建存储管理器实例 参数:

  • baseDir: 基础存储目录 | Base storage directory

返回值:

  • *Storage: 存储管理器实例 | Storage manager instance

func (*Storage) SaveHTML

func (s *Storage) SaveHTML(filename string, html []byte) (string, error)

SaveHTML 保存原始 HTML 到本地文件 按日期分目录存储,自动创建不存在的目录 参数:

  • filename: 自定义文件名(如 game_550.html) | Custom filename (e.g. game_550.html)
  • html: 原始 HTML 字节流 | Raw HTML byte stream

返回值:

  • string: 完整存储路径 | Full storage path
  • error: 保存失败时返回错误 | Error if save failed

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL