crawl

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindNextPageURL

func FindNextPageURL(html []byte, baseURL string, rule *strategy.PaginationRule) (string, error)

FindNextPageURL extracts the next page URL from the current page using the pagination rule.

Types

type Options

type Options struct {
	URL        string
	FetchOpts  fetch.Options
	Strategy   *strategy.ExtractionStrategy // pre-loaded strategy (nil = derive)
	FieldNames []string
	FieldDescs map[string]string
	Query      string // natural language query (--query mode)
	Provider   string
	Model      string
	APIKey     string
	MaxPages   int
	NoCache    bool
	NoHeal     bool
	Verbose    bool
}

Options configures a crawl run.

type Result

type Result struct {
	Strategy *strategy.ExtractionStrategy
	Extract  *extract.Result
	Pages    int
}

Result holds the output of a full crawl run.

func Run

func Run(ctx context.Context, opts Options) (*Result, error)

Run executes the full crawl pipeline: fetch, analyze, derive/load strategy, extract.

type WorkerPool

type WorkerPool struct {
	Concurrency int
	FetchOpts   fetch.Options
	Strategy    *strategy.ExtractionStrategy
	// contains filtered or unexported fields
}

WorkerPool manages concurrent page fetching and extraction.

func (*WorkerPool) ProcessURLs

func (wp *WorkerPool) ProcessURLs(urls []string) ([]*extract.Result, []error)

ProcessURLs fetches and extracts data from multiple URLs concurrently.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL