scraper

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 17, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package scraper drives headless Chrome (via chromedp) to pull job listings from sites that don't expose an official API. The host sites (LinkedIn, Indeed, Glassdoor) prohibit automated access in their Terms of Service. This package exists because the user accepted those risks explicitly; it should not be used at any scale that could harm those sites or get the user's IP / account flagged.

Selectors will drift. When something breaks, run the scrape with `--debug-html` (TODO) to dump the page HTML and update the per-site parser. Treat this package as load-bearing-but-fragile.

Index

Constants

View Source
const PoliteDelay = 1500 * time.Millisecond

PoliteDelay is how long callers should wait between consecutive scrapes against the same host. The number is conservative — too low and Cloudflare rate-limits us; too high and `import --all` becomes painfully slow. 1.5s is a compromise.

Variables

This section is empty.

Functions

func HostOf

func HostOf(url string) string

HostOf extracts the bare hostname from a URL for per-host throttling and logging. Caller should hold a map[host]time.Time to enforce PoliteDelay across multiple scrapes.

func SleepCtx

func SleepCtx(ctx context.Context, d time.Duration)

SleepCtx pauses for d unless ctx cancels first. Used between scrapes to enforce PoliteDelay while still respecting Ctrl+C.

Types

type Browser

type Browser struct {
	// contains filtered or unexported fields
}

Browser owns one chromedp allocator + tab pair. Reuse it across many FetchPage calls in a single import run; close it when done.

func NewBrowser

func NewBrowser(parent context.Context) (*Browser, error)

NewBrowser launches headless Chrome with a desktop-shaped viewport and a real-looking User-Agent. The first Run() actually starts the process; if Chrome isn't installed, this returns a clear error.

func (*Browser) Close

func (b *Browser) Close()

Close releases the underlying Chrome process and allocator. Safe to call multiple times.

type Result

type Result struct {
	Jobs       []job.Job
	Discovered int
}

Result is what one scrape produces. Discovered is the raw count seen in the HTML before any filtering — the caller can compare it to len(Jobs) to detect parsing problems (e.g., 25 cards visible but only 3 with usable URLs ⇒ selectors are stale).

func Scrape

func Scrape(b *Browser, src source.Source) (Result, error)

Scrape fetches and parses one scraper-backed source. Caller provides the browser context (a fresh one per source, or a shared one for an `import --all` run). Use NewBrowser to construct.

type SiteParser

type SiteParser interface {
	Name() string
	// WaitSelector is the CSS selector the headless browser will wait
	// for before reading the DOM. Empty means "no wait — just sleep
	// after navigate, then read."
	WaitSelector() string
	Parse(htmlText string, src source.Source) (jobs []job.Job, discovered int, err error)
}

SiteParser converts a search-page HTML string into normalized jobs. Each per-site file in this package implements one.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL