scraper

package

v0.6.0 Latest Latest Go to latest Published: May 17, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0xcryptj/jobforge

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor) prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.

Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.

Index ¶

Constants
func HostOf(url string) string
func SleepCtx(ctx context.Context, d time.Duration)
type Browser
- func NewBrowser(parent context.Context) (*Browser, error)
- func (b *Browser) Close()
type Result
- func Scrape(b *Browser, src source.Source) (Result, error)
type SiteParser

Constants ¶

View Source

const PoliteDelay = 1500 * time.Millisecond

PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.

Variables ¶

This section is empty.

Functions ¶

func HostOf ¶

func HostOf(url string) string

HostOf extracts the bare hostname from a URL for per-host throttling and logging. Caller should hold a map[host]time.Time to enforce PoliteDelay across multiple scrapes.

func SleepCtx ¶

func SleepCtx(ctx context.Context, d time.Duration)

SleepCtx pauses for d unless ctx cancels first. Used between scrapes to enforce PoliteDelay while still respecting Ctrl+C.

Types ¶

type Browser ¶

type Browser struct {
	// contains filtered or unexported fields
}

Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.

func NewBrowser ¶

func NewBrowser(parent context.Context) (*Browser, error)

NewBrowser launches headless Chrome with a desktop-shaped viewport and a real-looking User-Agent. The first Run() actually starts the process; if Chrome isn't installed, this returns a clear error.

func (*Browser) Close ¶

func (b *Browser) Close()

Close releases the underlying Chrome process and allocator. Safe to call multiple times.

type Result ¶

type Result struct {
	Jobs       []job.Job
	Discovered int
}

Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).

func Scrape ¶

func Scrape(b *Browser, src source.Source) (Result, error)

Scrape fetches and parses one scraper-backed source. Caller provides the browser context (a fresh one per source, or a shared one for an `import --all` run). Use NewBrowser to construct.

type SiteParser ¶

type SiteParser interface {
	Name() string
	// WaitSelector is the CSS selector the headless browser will wait
	// for before reading the DOM. Empty means "no wait — just sleep
	// after navigate, then read."
	WaitSelector() string
	Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}

SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL