scraper

package module

v1.0.0 Latest Latest Go to latest Published: Sep 4, 2025 License: Apache-2.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/makebooks-ai/scraper

Links

Open Source Insights

README ¶

scraper

A polite, concurrent web scraper you can use as a Go library or as a CLI.

Per-host rate limiting (RPS)
robots.txt compliance (cached per host)
Depth-limited crawling; same-host restriction
Include/Exclude regex for link enqueues
Retries with exponential backoff
Extracts <title>, meta/OG tags, visible text or custom CSS selectors
Streams JSON Lines (one page per line) from the CLI
Optional raw HTML snapshots to disk

Install

Choose a module path (examples use github.com/makebooks-ai/scraper).

go get github.com/makebooks-ai/scraper@latest

Local development from another repo? Add a replace in the consumer’s go.mod:
replace github.com/makebooks-ai/scraper => ../path/to/scraper

Library (recommended)

Import the package and stream results from Crawl. The API is intentionally small:

import (
  "context"
  "fmt"
  "net/url"
  "time"

  "github.com/makebooks-ai/scraper"
)

func Example() error {
  seed, _ := url.Parse("https://example.com")

  s := scraper.New(scraper.Config{
    Seed:          seed,
    DepthLimit:    1,
    SameHostOnly:  true,
    RespectRobots: true,
    NoFollow:      false,
    MaxPages:      0,           // unlimited
    RPS:           1.0,         // per-host
    Workers:       4,
    Timeout:       15 * time.Second,
    MaxRetries:    2,
    BackoffBase:   700 * time.Millisecond,
    UserAgent:     "scraper (+https://example.com)",
    // Optional:
    // Selectors: []string{"article h1", "article p"},
    // IncludeRe: regexp.MustCompile(`^https://example\.com/blog`),
    // ExcludeRe: regexp.MustCompile(`/page/`),
    // SaveHTMLDir: "./snapshots",
  })

  ch, err := s.Crawl(context.Background())
  if err != nil { return err }

  for rec := range ch {
    if rec.Error != "" {
      fmt.Println("ERR:", rec.URL, rec.Error)
      continue
    }
    fmt.Println(rec.Status, rec.Title, rec.URL)
  }
  return nil
}

`scraper.Config`

Field	Type	Meaning
`Seed`	`*url.URL`	Starting URL (required)
`DepthLimit`	`int`	Max link depth (0 = only seed)
`SameHostOnly`	`bool`	Only enqueue links on the seed host
`RespectRobots`	`bool`	Consult/cached `robots.txt`; drop disallowed paths
`NoFollow`	`bool`	Don’t enqueue links; just fetch/process current pages
`MaxPages`	`int`	Hard cap on unique pages (0 = unlimited)
`RPS`	`float64`	Requests per second per host
`Workers`	`int`	Worker goroutines
`Timeout`	`time.Duration`	HTTP request timeout
`MaxRetries`	`int`	Retry count for 5xx/temporary errors
`BackoffBase`	`time.Duration`	Exponential backoff base
`UserAgent`	`string`	UA header to send
`Selectors`	`[]string`	If non-empty, collect text from these CSS selectors; otherwise visible body text
`SaveHTMLDir`	`string`	If set, write raw HTML snapshots (hashed filenames)
`IncludeRe`	`*regexp.Regexp`	Enqueue only matching links
`ExcludeRe`	`*regexp.Regexp`	Skip matching links

`scraper.Result`

type Result struct {
  URL            string
  Status         int
  FetchedAt      time.Time
  Title          string
  MetaDesc       string
  OGTitle        string
  OGDescription  string
  ContentExcerpt string
  Links          []string
  Error          string // non-empty when fetch/parse blocked or failed
}

CLI

A thin wrapper around the library lives in cli/main.go.

Build

go build -o scraper ./cli/main.go

Usage

./scraper -url https://example.com -depth 1 -samehost -workers 8 -rps 1.5   -selectors "article h1,article p" -out pages.jsonl -save_html ./snapshots

Flags

Flag	Default	Description
`-url`	(required)	Seed URL
`-depth`	`1`	Max depth (0 = only seed)
`-samehost`	`false`	Only enqueue same-host links
`-workers`	`4`	Worker goroutines
`-rps`	`1.0`	Per-host requests/sec
`-ua`	`scraper (+https://example.com)`	User-Agent
`-timeout`	`15s`	HTTP timeout
`-selectors`	`""`	Comma-separated CSS selectors
`-include`	`""`	Regex; only enqueue links that match
`-exclude`	`""`	Regex; skip matching links
`-out`	stdout	JSONL output path
`-save_html`	`""`	Folder to dump raw HTML
`-retries`	`2`	Max retries for transient errors
`-backoff`	`700ms`	Backoff base
`-robots`	`true`	Respect `robots.txt`
`-no_follow`	`false`	Do not enqueue new links
`-max_pages`	`0`	Limit unique pages (0 = unlimited)

JSONL output schema

Each line is a single Result as JSON:

{"url":"https://example.com/","status":200,"fetched_at":"2025-01-01T12:00:00Z","title":"Home","meta_description":"...","og_title":"...","og_description":"...","content_excerpt":"...","links":["https://example.com/about"],"error":""}

Notes on politeness

Keep -rps and -workers modest.
Always respect a site’s Terms of Service and robots.txt.
Avoid scraping content behind logins or rate limits you don’t control.

Development

scraper/
├─ go.mod
├─ scraper.go     # importable library API
└─ cli/main.go    # CLI wiring to the library

Run tests:

go test ./... -v

Versioning

Follow semantic versions. For breaking API changes after v1, use Go’s semantic import versioning (/v2 suffix in the module path).

License

Apache 2.0. See LICENSE.

Documentation ¶

Index ¶

type Config
type Result
type Scraper
- func New(cfg Config) *Scraper
- func (s *Scraper) Crawl(ctx context.Context) (<-chan Result, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	Seed          *url.URL
	DepthLimit    int
	SameHostOnly  bool
	RespectRobots bool
	NoFollow      bool
	MaxPages      int

	RPS         float64
	Workers     int
	Timeout     time.Duration
	MaxRetries  int
	BackoffBase time.Duration

	UserAgent   string
	Selectors   []string // if empty, falls back to visible text excerpt
	SaveHTMLDir string

	IncludeRe *regexp.Regexp
	ExcludeRe *regexp.Regexp
}

type Result ¶

type Result struct {
	URL            string
	Status         int
	FetchedAt      time.Time
	Title          string
	MetaDesc       string
	OGTitle        string
	OGDescription  string
	ContentExcerpt string
	Links          []string
	Error          string
}

type Scraper ¶

type Scraper struct {
	// contains filtered or unexported fields
}

func New ¶

func New(cfg Config) *Scraper

func (*Scraper) Crawl ¶

func (s *Scraper) Crawl(ctx context.Context) (<-chan Result, error)

Crawl streams page Results on a returned channel and closes it on completion. Callers can range over the channel until it closes.

Source Files ¶

View all Source files

scraper.go

Directories ¶

Path	Synopsis
cli

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL