strategy

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: MIT Imports: 14 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SaveCache

func SaveCache(s *ExtractionStrategy) error

SaveCache writes a strategy to the cache directory.

func ValidateAgainstPage

func ValidateAgainstPage(s *ExtractionStrategy, html []byte) (int, []string, error)

ValidateAgainstPage checks if a strategy's selectors actually work on the given HTML. Returns the number of items found and any issues.

Types

type CandidateRegion

type CandidateRegion struct {
	Selector       string
	ItemCount      int
	Context        string // nearby heading text
	SectionID      string // nearest ancestor element ID (e.g. "market-share")
	Sample         string // first ~2000 chars of region HTML
	ItemSelector   string // CSS selector for a single repeating item
	SingleItemHTML string // HTML of one single item (for the LLM to examine)
}

CandidateRegion is a detected repeating data region on the page.

type DeriveRequest

type DeriveRequest struct {
	SimplifiedHTML   string
	URL              string
	FieldNames       []string
	FieldDescs       map[string]string // optional field descriptions
	Query            string            // natural language query (--query mode)
	Model            string
	Provider         string
	APIKey           string
	CandidateRegions []CandidateRegion // pre-detected repeating patterns (optional)
	RawHTML          []byte            // raw HTML for validation-retry loop (optional)
}

DeriveRequest holds the inputs for strategy derivation.

type ExtractionStrategy

type ExtractionStrategy struct {
	SitePattern       string          `json:"site_pattern"`
	ContainerSelector string          `json:"container_selector,omitempty"` // scopes item_selector to a specific page section
	ItemSelector      string          `json:"item_selector"`
	Fields            []FieldMapping  `json:"fields"`
	Pagination        *PaginationRule `json:"pagination,omitempty"`
	Confidence        float64         `json:"confidence"`
	Fingerprint       string          `json:"fingerprint"`
}

ExtractionStrategy is the LLM-derived plan for extracting data from a page.

func Derive

Derive calls the Anthropic API to derive an extraction strategy from HTML. If RawHTML is provided, it validates the strategy against the page and retries with feedback if selectors match 0 elements (up to 1 retry).

func LoadCached

func LoadCached(urlPattern, fingerprint string) (*ExtractionStrategy, error)

LoadCached attempts to load a cached strategy matching the URL pattern and fingerprint. Returns nil if no cached strategy exists or the fingerprint doesn't match.

func LoadFromFile

func LoadFromFile(path string) (*ExtractionStrategy, error)

LoadFromFile loads a strategy from a specific file path.

type FieldMapping

type FieldMapping struct {
	Name      string   `json:"name"`
	Selector  string   `json:"selector"`
	Attribute string   `json:"attribute"`           // "text", "href", "src", or any HTML attribute
	Transform string   `json:"transform,omitempty"` // "trim", "parse_price", "parse_date"
	Type      string   `json:"type"`
	Fallbacks []string `json:"fallbacks,omitempty"`
}

FieldMapping describes how to extract a single field from an item element.

type PaginationRule

type PaginationRule struct {
	Type       string `json:"type"` // "next_link", "url_increment", "load_more", "infinite_scroll"
	Selector   string `json:"selector"`
	URLPattern string `json:"url_pattern,omitempty"`
	HasMore    string `json:"has_more,omitempty"`
}

PaginationRule describes how to navigate between pages.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL