Documentation
¶
Index ¶
- Constants
- func RefineByUrlPattern(allEntries, validEntries []*model.Recipe) (string, int)
- func ReplayDiscovered(data *model.DataInput, feed *model.Feed, d *model.DiscoveredFeed) error
- func ScrapeFeed(data *model.DataInput, feed *model.Feed, sampling SamplingOptions) error
- func UrlPathPattern(urls []string) string
- type GroupValidator
- type SamplingOptions
Constants ¶
const ( SourceRSSLink model.DiscoverySource = "rss-link" SourceSitemap model.DiscoverySource = "sitemap" SourceDOMContainer model.DiscoverySource = "dom-container" )
Source constants for DiscoveredFeed.Source.
const (
// AcceptThreshold is the confidence score above which results are accepted without sampling.
AcceptThreshold = 0.70
)
Variables ¶
This section is empty.
Functions ¶
func RefineByUrlPattern ¶
RefineByUrlPattern derives the common URL path prefix from validEntries and returns the pattern and count of allEntries whose URL matches that prefix.
func ReplayDiscovered ¶
ReplayDiscovered replays a previously discovered feed configuration.
func ScrapeFeed ¶
ScrapeFeed runs the discovery pipeline against the given DataInput. On success, feed.Entries will contain stub recipes (Url only) and feed.Discovered will describe how they were found. When sampling.Validator is set, it is called with a sample of URLs from each DOM candidate group before committing it. Groups where fewer than half the sampled URLs are recipes are skipped; the next-best group is tried instead.
func UrlPathPattern ¶
UrlPathPattern returns the common path prefix of URLs as a string. Example: ["/recipes/pasta", "/recipes/chicken"] → "/recipes/"
Types ¶
type GroupValidator ¶ added in v0.18.0
GroupValidator samples a slice of candidate URLs and returns the subset that are confirmed recipe pages (already scraped, ready to merge into feed entries).
type SamplingOptions ¶ added in v0.18.0
type SamplingOptions struct {
// Validator is called with a sample of candidate URLs; nil disables validation.
Validator GroupValidator
// SampleSize is the number of URLs to sample per candidate group; 0 disables sampling.
SampleSize int
}
SamplingOptions configures optional URL validation during DOM discovery. The zero value disables sampling entirely.