Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SaveCache ¶
func SaveCache(s *ExtractionStrategy) error
SaveCache writes a strategy to the cache directory.
func ValidateAgainstPage ¶
func ValidateAgainstPage(s *ExtractionStrategy, html []byte) (int, []string, error)
ValidateAgainstPage checks if a strategy's selectors actually work on the given HTML. Returns the number of items found and any issues.
Types ¶
type CandidateRegion ¶
type CandidateRegion struct {
Selector string
ItemCount int
Context string // nearby heading text
SectionID string // nearest ancestor element ID (e.g. "market-share")
Sample string // first ~2000 chars of region HTML
ItemSelector string // CSS selector for a single repeating item
SingleItemHTML string // HTML of one single item (for the LLM to examine)
}
CandidateRegion is a detected repeating data region on the page.
type DeriveRequest ¶
type DeriveRequest struct {
SimplifiedHTML string
URL string
FieldNames []string
FieldDescs map[string]string // optional field descriptions
Query string // natural language query (--query mode)
Model string
Provider string
APIKey string
CandidateRegions []CandidateRegion // pre-detected repeating patterns (optional)
RawHTML []byte // raw HTML for validation-retry loop (optional)
}
DeriveRequest holds the inputs for strategy derivation.
type ExtractionStrategy ¶
type ExtractionStrategy struct {
SitePattern string `json:"site_pattern"`
ContainerSelector string `json:"container_selector,omitempty"` // scopes item_selector to a specific page section
ItemSelector string `json:"item_selector"`
Fields []FieldMapping `json:"fields"`
Pagination *PaginationRule `json:"pagination,omitempty"`
Confidence float64 `json:"confidence"`
Fingerprint string `json:"fingerprint"`
}
ExtractionStrategy is the LLM-derived plan for extracting data from a page.
func Derive ¶
func Derive(ctx context.Context, req DeriveRequest) (*ExtractionStrategy, error)
Derive calls the Anthropic API to derive an extraction strategy from HTML. If RawHTML is provided, it validates the strategy against the page and retries with feedback if selectors match 0 elements (up to 1 retry).
func LoadCached ¶
func LoadCached(urlPattern, fingerprint string) (*ExtractionStrategy, error)
LoadCached attempts to load a cached strategy matching the URL pattern and fingerprint. Returns nil if no cached strategy exists or the fingerprint doesn't match.
func LoadFromFile ¶
func LoadFromFile(path string) (*ExtractionStrategy, error)
LoadFromFile loads a strategy from a specific file path.
type FieldMapping ¶
type FieldMapping struct {
Name string `json:"name"`
Selector string `json:"selector"`
Attribute string `json:"attribute"` // "text", "href", "src", or any HTML attribute
Transform string `json:"transform,omitempty"` // "trim", "parse_price", "parse_date"
Type string `json:"type"`
Fallbacks []string `json:"fallbacks,omitempty"`
}
FieldMapping describes how to extract a single field from an item element.
type PaginationRule ¶
type PaginationRule struct {
Type string `json:"type"` // "next_link", "url_increment", "load_more", "infinite_scroll"
Selector string `json:"selector"`
URLPattern string `json:"url_pattern,omitempty"`
HasMore string `json:"has_more,omitempty"`
}
PaginationRule describes how to navigate between pages.