Documentation
¶
Overview ¶
Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor, Handshake) may prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.
Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.
Index ¶
- Constants
- func FindChromeExecutable() (string, error)
- func FindPython() (string, error)
- func HostOf(url string) string
- func Install(ctx context.Context, out io.Writer) (string, error)
- func SleepCtx(ctx context.Context, d time.Duration)
- type Browser
- type Config
- type Engine
- type Result
- type SiteParser
- type Status
Constants ¶
const PoliteDelay = 1500 * time.Millisecond
PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.
Variables ¶
This section is empty.
Functions ¶
func FindChromeExecutable ¶ added in v0.7.0
FindChromeExecutable returns the path to a Chromium-based browser for scraping.
func FindPython ¶ added in v0.7.0
FindPython returns a Python executable suitable for pip (python3, python, or py -3).
func HostOf ¶
HostOf extracts the bare hostname from a URL for per-host throttling and logging. Caller should hold a map[host]time.Time to enforce PoliteDelay across multiple scrapes.
Types ¶
type Browser ¶
type Browser struct {
// contains filtered or unexported fields
}
Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.
func NewBrowser ¶
NewBrowser launches a browser for scraping using cfg. When cfg is zero, ResolveConfig is applied with empty file settings (env + auto engine).
type Config ¶ added in v0.7.0
type Config struct {
Engine Engine
CDPURL string
BinaryPath string
Proxy string
FingerprintSeed int // 0 = random per launch (Cloak default)
}
Config controls how scraper-backed sources launch a browser.
func ResolveConfig ¶ added in v0.7.0
ResolveConfig merges optional file settings with environment overrides. Env wins when set: JOBFORGE_SCRAPER_ENGINE, JOBFORGE_CDP_URL, CLOAKBROWSER_BINARY_PATH, JOBFORGE_SCRAPER_PROXY, JOBFORGE_SCRAPER_FINGERPRINT.
type Engine ¶ added in v0.7.0
type Engine string
Engine selects which browser backs scraper imports.
type Result ¶
Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).
type SiteParser ¶
type SiteParser interface {
Name() string
// WaitSelector is the CSS selector the headless browser will wait
// for before reading the DOM. Empty means "no wait — just sleep
// after navigate, then read."
WaitSelector() string
Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}
SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.
type Status ¶ added in v0.7.0
type Status struct {
ConfigEngine Engine
ResolvedEngine Engine
CloakInstalled bool
CloakPath string
PythonPath string
ChromeOnPath bool
Ready bool
Hint string
}
Status summarizes scraper readiness for status/doctor output.
func CheckStatus ¶ added in v0.7.0
CheckStatus inspects config and the local machine without launching a browser.