scraper

package
v0.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2026 License: MIT Imports: 17 Imported by: 0

Documentation

Overview

Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor, Handshake) may prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.

Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.

Index

Constants

View Source
const PoliteDelay = 1500 * time.Millisecond

PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.

Variables

This section is empty.

Functions

func FindChromeExecutable added in v0.7.0

func FindChromeExecutable() (string, error)

FindChromeExecutable returns the path to a Chromium-based browser for scraping.

func FindPython added in v0.7.0

func FindPython() (string, error)

FindPython returns a Python executable suitable for pip (python3, python, or py -3).

func HostOf

func HostOf(url string) string

HostOf extracts the bare hostname from a URL for per-host throttling and logging. Caller should hold a map[host]time.Time to enforce PoliteDelay across multiple scrapes.

func Install added in v0.7.0

func Install(ctx context.Context, out io.Writer) (string, error)

Install downloads CloakBrowser via pip and runs `python -m cloakbrowser install`.

func SleepCtx

func SleepCtx(ctx context.Context, d time.Duration)

SleepCtx pauses for d unless ctx cancels first. Used between scrapes to enforce PoliteDelay while still respecting Ctrl+C.

Types

type Browser

type Browser struct {
	// contains filtered or unexported fields
}

Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.

func NewBrowser

func NewBrowser(parent context.Context, cfg Config) (*Browser, error)

NewBrowser launches a browser for scraping using cfg. When cfg is zero, ResolveConfig is applied with empty file settings (env + auto engine).

func (*Browser) Close

func (b *Browser) Close()

Close releases the underlying Chrome process and allocator. Safe to call multiple times.

func (*Browser) Engine added in v0.7.0

func (b *Browser) Engine() Engine

Engine reports which backend this browser is using.

type Config added in v0.7.0

type Config struct {
	Engine          Engine
	CDPURL          string
	BinaryPath      string
	Proxy           string
	FingerprintSeed int // 0 = random per launch (Cloak default)
}

Config controls how scraper-backed sources launch a browser.

func FromApp added in v0.7.0

func FromApp(engine, cdpURL, binaryPath, proxy string, fingerprintSeed int) Config

FromApp maps config.json scraper settings into a scraper Config.

func ResolveConfig added in v0.7.0

func ResolveConfig(file Config) Config

ResolveConfig merges optional file settings with environment overrides. Env wins when set: JOBFORGE_SCRAPER_ENGINE, JOBFORGE_CDP_URL, CLOAKBROWSER_BINARY_PATH, JOBFORGE_SCRAPER_PROXY, JOBFORGE_SCRAPER_FINGERPRINT.

type Engine added in v0.7.0

type Engine string

Engine selects which browser backs scraper imports.

const (
	EngineAuto   Engine = "auto"   // use Cloak if a binary is found, else Chrome
	EngineChrome Engine = "chrome" // system Chrome/Chromium via chromedp
	EngineCloak  Engine = "cloak"  // CloakBrowser patched Chromium binary
	EngineCDP    Engine = "cdp"    // attach to an existing browser (cloakserve, etc.)
)

type Result

type Result struct {
	Jobs       []job.Job
	Discovered int
}

Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).

func Scrape

func Scrape(b *Browser, src source.Source) (Result, error)

Scrape fetches and parses one scraper-backed source. Caller provides the browser context (a fresh one per source, or a shared one for an `import --all` run). Use NewBrowser to construct.

type SiteParser

type SiteParser interface {
	Name() string
	// WaitSelector is the CSS selector the headless browser will wait
	// for before reading the DOM. Empty means "no wait — just sleep
	// after navigate, then read."
	WaitSelector() string
	Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}

SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.

type Status added in v0.7.0

type Status struct {
	ConfigEngine   Engine
	ResolvedEngine Engine
	CloakInstalled bool
	CloakPath      string
	PythonPath     string
	ChromeOnPath   bool
	Ready          bool
	Hint           string
}

Status summarizes scraper readiness for status/doctor output.

func CheckStatus added in v0.7.0

func CheckStatus(file Config) Status

CheckStatus inspects config and the local machine without launching a browser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL