scraper

package

v0.7.1 Latest Latest Go to latest Published: May 19, 2026 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0xcryptj/jobforge

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor, Handshake) may prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.

Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.

Index ¶

Constants
func FindChromeExecutable() (string, error)
func FindPython() (string, error)
func HostOf(url string) string
func Install(ctx context.Context, out io.Writer) (string, error)
func SleepCtx(ctx context.Context, d time.Duration)
type Browser
- func NewBrowser(parent context.Context, cfg Config) (*Browser, error)
- func (b *Browser) Close()
- func (b *Browser) Engine() Engine
type Config
- func FromApp(engine, cdpURL, binaryPath, proxy string, fingerprintSeed int) Config
- func ResolveConfig(file Config) Config
type Engine
type Result
- func Scrape(b *Browser, src source.Source) (Result, error)
type SiteParser
type Status
- func CheckStatus(file Config) Status

Constants ¶

View Source

const PoliteDelay = 1500 * time.Millisecond

PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.

Variables ¶

This section is empty.

Functions ¶

func FindChromeExecutable ¶ added in v0.7.0

func FindChromeExecutable() (string, error)

FindChromeExecutable returns the path to a Chromium-based browser for scraping.

func FindPython ¶ added in v0.7.0

func FindPython() (string, error)

FindPython returns a Python executable suitable for pip (python3, python, or py -3).

func HostOf ¶

func HostOf(url string) string

HostOf extracts the bare hostname from a URL for per-host throttling and logging. Caller should hold a map[host]time.Time to enforce PoliteDelay across multiple scrapes.

func Install ¶ added in v0.7.0

func Install(ctx context.Context, out io.Writer) (string, error)

Install downloads CloakBrowser via pip and runs `python -m cloakbrowser install`.

func SleepCtx ¶

func SleepCtx(ctx context.Context, d time.Duration)

SleepCtx pauses for d unless ctx cancels first. Used between scrapes to enforce PoliteDelay while still respecting Ctrl+C.

Types ¶

type Browser ¶

type Browser struct {
	// contains filtered or unexported fields
}

Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.

func NewBrowser ¶

func NewBrowser(parent context.Context, cfg Config) (*Browser, error)

NewBrowser launches a browser for scraping using cfg. When cfg is zero, ResolveConfig is applied with empty file settings (env + auto engine).

func (*Browser) Close ¶

func (b *Browser) Close()

Close releases the underlying Chrome process and allocator. Safe to call multiple times.

func (*Browser) Engine ¶ added in v0.7.0

func (b *Browser) Engine() Engine

Engine reports which backend this browser is using.

type Config ¶ added in v0.7.0

type Config struct {
	Engine          Engine
	CDPURL          string
	BinaryPath      string
	Proxy           string
	FingerprintSeed int // 0 = random per launch (Cloak default)
}

Config controls how scraper-backed sources launch a browser.

func FromApp ¶ added in v0.7.0

func FromApp(engine, cdpURL, binaryPath, proxy string, fingerprintSeed int) Config

FromApp maps config.json scraper settings into a scraper Config.

func ResolveConfig ¶ added in v0.7.0

func ResolveConfig(file Config) Config

ResolveConfig merges optional file settings with environment overrides. Env wins when set: JOBFORGE_SCRAPER_ENGINE, JOBFORGE_CDP_URL, CLOAKBROWSER_BINARY_PATH, JOBFORGE_SCRAPER_PROXY, JOBFORGE_SCRAPER_FINGERPRINT.

type Engine ¶ added in v0.7.0

type Engine string

Engine selects which browser backs scraper imports.

const (
	EngineAuto   Engine = "auto"   // use Cloak if a binary is found, else Chrome
	EngineChrome Engine = "chrome" // system Chrome/Chromium via chromedp
	EngineCloak  Engine = "cloak"  // CloakBrowser patched Chromium binary
	EngineCDP    Engine = "cdp"    // attach to an existing browser (cloakserve, etc.)
)

type Result ¶

type Result struct {
	Jobs       []job.Job
	Discovered int
}

Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).

func Scrape ¶

func Scrape(b *Browser, src source.Source) (Result, error)

Scrape fetches and parses one scraper-backed source. Caller provides the browser context (a fresh one per source, or a shared one for an `import --all` run). Use NewBrowser to construct.

type SiteParser ¶

type SiteParser interface {
	Name() string
	// WaitSelector is the CSS selector the headless browser will wait
	// for before reading the DOM. Empty means "no wait — just sleep
	// after navigate, then read."
	WaitSelector() string
	Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}

SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.

type Status ¶ added in v0.7.0

type Status struct {
	ConfigEngine   Engine
	ResolvedEngine Engine
	CloakInstalled bool
	CloakPath      string
	PythonPath     string
	ChromeOnPath   bool
	Ready          bool
	Hint           string
}

Status summarizes scraper readiness for status/doctor output.

func CheckStatus ¶ added in v0.7.0

func CheckStatus(file Config) Status

CheckStatus inspects config and the local machine without launching a browser.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL