spider

package
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: Apache-2.0 Imports: 31 Imported by: 0

Documentation

Overview

Package spider provides spider rule definition, species registration, and parsing.

Index

Constants

View Source
const (
	KEYIN       = util.USE_KEYIN // rules that use Spider.Keyin must set its initial value to USE_KEYIN
	LIMIT       = math.MaxInt64  // rules that customize Limit must set its initial value to LIMIT
	FORCED_STOP = "-- Forced stop of Spider --"
)
View Source
const (
	A = iota // alarm mode
	T        // countdown mode
)

Variables

View Source
var ErrForcedStop = errors.New("forced stop")
View Source
var Species = &SpiderSpecies{
	list: []*Spider{},
	hash: map[string]*Spider{},
}

Species is the singleton spider registry.

Functions

func PutContext

func PutContext(ctx *Context)

PutContext resets a Context and returns it to the pool.

func RegisterDynamicSpiders added in v1.4.0

func RegisterDynamicSpiders()

RegisterDynamicSpiders loads and registers all dynamic (JS-based) spider rules from config.Conf().SpiderDir. Safe to call multiple times; only the first call performs registration.

Types

type Bell

type Bell struct {
	Hour int
	Min  int
	Sec  int
}

Bell specifies a time-of-day for alarm mode.

type Clock

type Clock struct {
	// contains filtered or unexported fields
}

Clock represents a single alarm or countdown timer.

type Context

type Context struct {
	Request  *request.Request
	Response *http.Response // URL is copied from *request.Request

	sync.Mutex
	// contains filtered or unexported fields
}

Context carries the state for a single crawl request through its lifecycle.

func GetContext

func GetContext(sp *Spider, req *request.Request) *Context

GetContext retrieves a Context from the pool and binds it to the given spider and request.

func (*Context) AddQueue

func (ctx *Context) AddQueue(req *request.Request) *Context

AddQueue validates and enqueues a new crawl request.

Required fields: Request.URL, Request.Rule. Request.Spider is set automatically; Request.EnableCookie is inherited from Spider.

Fields with defaults (may be omitted):

  • Method: GET
  • DialTimeout: request.DefaultDialTimeout (negative = unlimited)
  • ConnTimeout: request.DefaultConnTimeout (negative = unlimited)
  • TryTimes: request.DefaultTryTimes (negative = unlimited retries)
  • RedirectTimes: unlimited by default (negative = disable redirects)
  • RetryPause: request.DefaultRetryPause
  • DownloaderID: 0 = Surf (fast, full-featured), 1 = PhantomJS (slow, JS-capable)

Referer is auto-filled from the current response URL if not set.

func (*Context) Aid

func (ctx *Context) Aid(aid map[string]interface{}, ruleName ...string) interface{}

Aid invokes the AidFunc of the specified rule. An empty ruleName defaults to the current rule.

func (*Context) CopyRequest

func (ctx *Context) CopyRequest() *request.Request

CopyRequest returns a deep copy of the original request.

func (*Context) CopyTemps

func (ctx *Context) CopyTemps() request.Temp

CopyTemps returns a shallow copy of the request's temporary data.

func (*Context) CreateItem added in v1.4.0

func (ctx *Context) CreateItem(item map[int]interface{}, ruleName ...string) map[string]interface{}

CreateItem builds a text result map keyed by field names using the ItemFields of ruleName. An empty ruleName defaults to the current rule.

func (*Context) FileOutput

func (ctx *Context) FileOutput(nameOrExt ...string)

FileOutput collects a file result from the response body. nameOrExt optionally specifies a file name or extension; empty keeps the original. Errors are logged internally; no return value for JS VM compatibility.

func (*Context) GetCookie

func (ctx *Context) GetCookie() string

GetCookie returns the Set-Cookie header from the response.

func (*Context) GetDom

func (ctx *Context) GetDom() *goquery.Document

GetDom returns the parsed HTML DOM, initializing it lazily from the response body. Errors are stored in ctx.err and can be retrieved via GetError().

func (*Context) GetError

func (ctx *Context) GetError() error

GetError returns the download error, or the spider's stop error if stopping.

func (*Context) GetHeader

func (ctx *Context) GetHeader() http.Header

GetHeader returns the response headers.

func (*Context) GetHost

func (ctx *Context) GetHost() string

GetHost returns the host from the response URL, or "" if unavailable.

func (*Context) GetItemField

func (ctx *Context) GetItemField(index int, ruleName ...string) (field string)

GetItemField returns the field name at the given index, or "" if not found. An empty ruleName defaults to the current rule.

func (*Context) GetItemFieldIndex

func (ctx *Context) GetItemFieldIndex(field string, ruleName ...string) (index int)

GetItemFieldIndex returns the index of the given field name, or -1 if not found. An empty ruleName defaults to the current rule.

func (*Context) GetItemFields

func (ctx *Context) GetItemFields(ruleName ...string) []string

GetItemFields returns the result field name list for the given rule.

func (*Context) GetKeyin

func (ctx *Context) GetKeyin() string

GetKeyin returns the custom keyword/configuration input.

func (*Context) GetLimit

func (ctx *Context) GetLimit() int

GetLimit returns the maximum number of items to crawl.

func (*Context) GetMethod

func (ctx *Context) GetMethod() string

GetMethod returns the HTTP method of the request.

func (*Context) GetName

func (ctx *Context) GetName() string

GetName returns the spider name.

func (*Context) GetReferer

func (ctx *Context) GetReferer() string

GetReferer returns the Referer header from the actual HTTP request made.

func (*Context) GetRequest

func (ctx *Context) GetRequest() *request.Request

GetRequest returns the original request.

func (*Context) GetRequestHeader

func (ctx *Context) GetRequestHeader() http.Header

GetRequestHeader returns the request headers from the actual HTTP request made.

func (*Context) GetResponse

func (ctx *Context) GetResponse() *http.Response

GetResponse returns the HTTP response.

func (*Context) GetRule

func (ctx *Context) GetRule(ruleName string) *Rule

GetRule returns the rule with the given name.

func (*Context) GetRuleName

func (ctx *Context) GetRuleName() string

GetRuleName returns the current rule name from the request.

func (*Context) GetRules

func (ctx *Context) GetRules() map[string]*Rule

GetRules returns the full rule map.

func (*Context) GetSpider

func (ctx *Context) GetSpider() *Spider

GetSpider returns the spider bound to this context.

func (*Context) GetStatusCode

func (ctx *Context) GetStatusCode() int

GetStatusCode returns the HTTP response status code, or 0 if no response.

func (*Context) GetTemp

func (ctx *Context) GetTemp(key string, defaultValue interface{}) interface{}

GetTemp retrieves temporary data from the request by key. defaultValue must not be a nil interface{}.

func (*Context) GetTemps

func (ctx *Context) GetTemps() request.Temp

GetTemps returns all temporary data from the request.

func (*Context) GetText

func (ctx *Context) GetText() string

GetText returns the response body as a UTF-8 string, initializing it lazily. Errors are stored in ctx.err and can be retrieved via GetError().

func (*Context) GetURL added in v1.4.0

func (ctx *Context) GetURL() string

GetURL returns the URL from the original request, preserving the unencoded form.

func (*Context) JsAddQueue

func (ctx *Context) JsAddQueue(jreq map[string]interface{}) *Context

JsAddQueue adds crawl requests from dynamic (JavaScript) rule definitions.

func (*Context) Log

func (*Context) Log() logs.Logs

Log returns the global logger instance.

func (*Context) Output

func (ctx *Context) Output(item interface{}, ruleName ...string)

Output collects a text result item.

When item is map[int]interface{}, fields are mapped using the existing ItemFields of ruleName. When item is map[string]interface{}, missing ItemFields are auto-added. An empty ruleName defaults to the current rule.

func (*Context) Parse

func (ctx *Context) Parse(ruleName ...string) *Context

Parse dispatches the response to the ParseFunc of the specified rule. An empty ruleName defaults to Root().

func (*Context) PullFiles

func (ctx *Context) PullFiles() (fs []data.FileCell)

PullFiles drains and returns all collected file results, resetting the internal buffer.

func (*Context) PullItems

func (ctx *Context) PullItems() (ds []data.DataCell)

PullItems drains and returns all collected data items, resetting the internal buffer.

func (*Context) ResetText

func (ctx *Context) ResetText(body string) *Context

ResetText replaces the downloaded text content and invalidates the DOM cache.

func (*Context) RunTimer

func (ctx *Context) RunTimer(id string) bool

RunTimer starts the timer and reports whether it can continue to be used.

func (*Context) SetError

func (ctx *Context) SetError(err error)

SetError marks a download error on this context.

func (*Context) SetKeyin

func (ctx *Context) SetKeyin(keyin string) *Context

SetKeyin sets the custom keyword/configuration input.

func (*Context) SetLimit

func (ctx *Context) SetLimit(max int) *Context

SetLimit sets the maximum number of items to crawl.

func (*Context) SetPausetime

func (ctx *Context) SetPausetime(pause int64, runtime ...bool) *Context

SetPausetime sets a custom pause interval (randomized: pause/2 ~ pause*2). Overrides the externally configured value. Only overwrites an existing value when runtime[0] is true.

func (*Context) SetReferer

func (ctx *Context) SetReferer(referer string) *Context

func (*Context) SetResponse

func (ctx *Context) SetResponse(resp *http.Response) *Context

SetResponse binds the HTTP response to this context.

func (*Context) SetTemp

func (ctx *Context) SetTemp(key string, value interface{}) *Context

SetTemp stores temporary data in the current request.

func (*Context) SetTimer

func (ctx *Context) SetTimer(id string, tol time.Duration, bell *Bell) bool

SetTimer configures a timer identified by id. When bell is nil, tol is a sleep duration (countdown timer). When bell is non-nil, tol specifies the wake-up point (the tol-th bell occurrence from now).

func (*Context) SetURL added in v1.4.0

func (ctx *Context) SetURL(url string) *Context

func (*Context) UpsertItemField

func (ctx *Context) UpsertItemField(field string, ruleName ...string) (index int)

UpsertItemField adds a result field name to the given rule and returns its index. If the field already exists, the existing index is returned. An empty ruleName defaults to the current rule.

type Rule

type Rule struct {
	ItemFields []string                                           // result field names (optional; preserves field order)
	ParseFunc  func(*Context)                                     // content parsing function
	AidFunc    func(*Context, map[string]interface{}) interface{} // auxiliary helper function
}

Rule defines a single crawl rule node.

type RuleModle

type RuleModle struct {
	Name      string `xml:"name,attr"`
	ParseFunc string `xml:"ParseFunc>Script"`
	AidFunc   string `xml:"AidFunc>Script"`
}

RuleModle is the XML model for a single dynamic rule node.

type RuleTree

type RuleTree struct {
	Root  func(*Context)   // entry point
	Trunk map[string]*Rule // rule map (keyed by rule name)
}

RuleTree defines the crawl rule tree.

type Spider

type Spider struct {
	// User-defined fields
	Name            string                                                     // display name (must be unique)
	Description     string                                                     // display description
	Pausetime       int64                                                      // random pause range (50%~200%); if set in rule, overrides UI parameter
	Limit           int64                                                      // request limit (0 = unlimited; set to LIMIT for custom limit logic in rules)
	Keyin           string                                                     // custom input config (set to KEYIN in rules to enable)
	EnableCookie    bool                                                       // whether requests carry cookies
	NotDefaultField bool                                                       // disable default output fields Url/ParentUrl/DownloadTime
	Namespace       func(sp *Spider) string                                    // namespace for output file/path naming
	SubNamespace    func(self *Spider, dataCell map[string]interface{}) string // sub-namespace, may depend on specific data content
	RuleTree        *RuleTree                                                  // crawl rule tree
	// contains filtered or unexported fields
}

Spider defines a crawl spider with its rules and runtime state.

func (*Spider) CanStop

func (sp *Spider) CanStop() bool

CanStop reports whether the spider can transition to a stopped state.

func (*Spider) Copy

func (sp *Spider) Copy() *Spider

Copy returns a deep copy of the spider, including its rule tree.

func (*Spider) Defer

func (sp *Spider) Defer()

Defer performs cleanup before the spider exits: cancels timers, waits for in-flight requests, and flushes failures.

func (*Spider) DoHistory

func (sp *Spider) DoHistory(req *request.Request, ok bool) bool

DoHistory records request history and reports whether a failed request was re-enqueued.

func (*Spider) GetDescription

func (sp *Spider) GetDescription() string

GetDescription returns the spider description.

func (*Spider) GetEnableCookie

func (sp *Spider) GetEnableCookie() bool

GetEnableCookie reports whether requests carry cookies.

func (*Spider) GetID added in v1.4.0

func (sp *Spider) GetID() int

GetID returns the spider's queue index.

func (*Spider) GetItemField

func (sp *Spider) GetItemField(rule *Rule, index int) (field string)

GetItemField returns the field name at the given index, or "" if out of range.

func (*Spider) GetItemFieldIndex

func (sp *Spider) GetItemFieldIndex(rule *Rule, field string) (index int)

GetItemFieldIndex returns the index of the given field name, or -1 if not found.

func (*Spider) GetItemFields

func (sp *Spider) GetItemFields(rule *Rule) []string

GetItemFields returns the result field names for the given rule.

func (*Spider) GetKeyin

func (sp *Spider) GetKeyin() string

GetKeyin returns the custom keyword/configuration input.

func (*Spider) GetLimit

func (sp *Spider) GetLimit() int64

GetLimit returns the crawl limit. Negative means request-count limiting; positive means custom rule-based limiting.

func (*Spider) GetName

func (sp *Spider) GetName() string

GetName returns the spider name.

func (*Spider) GetRule

func (sp *Spider) GetRule(ruleName string) *Rule

GetRule returns the rule with the given name.

func (*Spider) GetRules

func (sp *Spider) GetRules() map[string]*Rule

GetRules returns the full rule map.

func (*Spider) GetSubName

func (sp *Spider) GetSubName() string

GetSubName returns the secondary identifier derived from Keyin (computed once).

func (*Spider) IsStopping

func (sp *Spider) IsStopping() bool

IsStopping reports whether the spider is in the process of stopping.

func (*Spider) MustGetRule

func (sp *Spider) MustGetRule(ruleName string) *Rule

MustGetRule returns the rule with the given name (panics if missing).

func (*Spider) OutDefaultField

func (sp *Spider) OutDefaultField() bool

OutDefaultField reports whether default fields (Url/ParentUrl/DownloadTime) should be included in output.

func (*Spider) Register

func (sp *Spider) Register() *Spider

Register adds this spider to the global species list.

func (*Spider) ReqmatrixInit

func (sp *Spider) ReqmatrixInit() *Spider

ReqmatrixInit initializes the request scheduling matrix for this spider.

func (*Spider) RequestFree

func (sp *Spider) RequestFree()

func (*Spider) RequestLen

func (sp *Spider) RequestLen() int

func (*Spider) RequestPull

func (sp *Spider) RequestPull() *request.Request

RequestPull dequeues the next request from the scheduling matrix.

func (*Spider) RequestPush

func (sp *Spider) RequestPush(req *request.Request)

RequestPush enqueues a request into the scheduling matrix.

func (*Spider) RequestUse

func (sp *Spider) RequestUse()

func (*Spider) RunTimer

func (sp *Spider) RunTimer(id string) bool

RunTimer starts the timer and reports whether it can continue to be used.

func (*Spider) SetID added in v1.4.0

func (sp *Spider) SetID(id int)

SetID assigns the spider's queue index.

func (*Spider) SetKeyin

func (sp *Spider) SetKeyin(keyword string)

SetKeyin sets the custom keyword/configuration input.

func (*Spider) SetLimit

func (sp *Spider) SetLimit(max int64)

SetLimit sets the crawl limit.

func (*Spider) SetPausetime

func (sp *Spider) SetPausetime(pause int64, runtime ...bool)

SetPausetime sets a custom pause interval. Only overwrites an existing value when runtime[0] is true.

func (*Spider) SetTimer

func (sp *Spider) SetTimer(id string, tol time.Duration, bell *Bell) bool

SetTimer configures a timer identified by id. When bell is nil, tol is a countdown sleep duration; otherwise tol specifies the wake-up occurrence.

func (*Spider) Start

func (sp *Spider) Start()

Start executes the spider's root rule.

func (*Spider) Stop

func (sp *Spider) Stop()

Stop gracefully stops the spider and cancels all timers.

func (*Spider) TryFlushFailure

func (sp *Spider) TryFlushFailure()

func (*Spider) TryFlushSuccess

func (sp *Spider) TryFlushSuccess()

func (*Spider) UpsertItemField

func (sp *Spider) UpsertItemField(rule *Rule, field string) (index int)

UpsertItemField appends a result field name to the rule and returns its index. If the field already exists, the existing index is returned.

type SpiderModle

type SpiderModle struct {
	Name            string      `xml:"Name"`
	Description     string      `xml:"Description"`
	Pausetime       int64       `xml:"Pausetime"`
	EnableLimit     bool        `xml:"EnableLimit"`
	EnableKeyin     bool        `xml:"EnableKeyin"`
	EnableCookie    bool        `xml:"EnableCookie"`
	NotDefaultField bool        `xml:"NotDefaultField"`
	Namespace       string      `xml:"Namespace>Script"`
	SubNamespace    string      `xml:"SubNamespace>Script"`
	Root            string      `xml:"Root>Script"`
	Trunk           []RuleModle `xml:"Rule"`
}

SpiderModle is the XML model for dynamic (JavaScript-based) spider rules.

type SpiderSpecies

type SpiderSpecies struct {
	// contains filtered or unexported fields
}

SpiderSpecies is the global registry of available spider types.

func (*SpiderSpecies) Add

func (ss *SpiderSpecies) Add(sp *Spider) *Spider

Add registers a spider. If the name already exists, a numeric suffix is appended.

func (*SpiderSpecies) Get

func (ss *SpiderSpecies) Get() []*Spider

Get returns all registered spiders, sorted by pinyin initials on first call. Dynamic spiders are lazily registered on first access.

func (*SpiderSpecies) GetByNameOpt added in v1.4.0

func (ss *SpiderSpecies) GetByNameOpt(name string) option.Option[*Spider]

GetByNameOpt returns the spider with the given name as Option.

type Timer

type Timer struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

Timer manages a collection of named clocks (countdown timers or alarms).

Directories

Path Synopsis
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.
Package common provides HTML cleaning, form parsing, and other utility functions for spider rules.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL