Documentation
¶
Overview ¶
Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor) prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.
Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.
Index ¶
Constants ¶
const PoliteDelay = 1500 * time.Millisecond
PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Browser ¶
type Browser struct {
// contains filtered or unexported fields
}
Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.
func NewBrowser ¶
NewBrowser launches headless Chrome with a desktop-shaped viewport and a real-looking User-Agent. The first Run() actually starts the process; if Chrome isn't installed, this returns a clear error.
type Result ¶
Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).
type SiteParser ¶
type SiteParser interface {
Name() string
// WaitSelector is the CSS selector the headless browser will wait
// for before reading the DOM. Empty means "no wait — just sleep
// after navigate, then read."
WaitSelector() string
Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}
SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.