scraper

package module
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 4, 2025 License: Apache-2.0 Imports: 17 Imported by: 0

README

scraper

A polite, concurrent web scraper you can use as a Go library or as a CLI.

  • Per-host rate limiting (RPS)
  • robots.txt compliance (cached per host)
  • Depth-limited crawling; same-host restriction
  • Include/Exclude regex for link enqueues
  • Retries with exponential backoff
  • Extracts <title>, meta/OG tags, visible text or custom CSS selectors
  • Streams JSON Lines (one page per line) from the CLI
  • Optional raw HTML snapshots to disk

Install

Choose a module path (examples use github.com/makebooks-ai/scraper).

go get github.com/makebooks-ai/scraper@latest

Local development from another repo? Add a replace in the consumer’s go.mod:

replace github.com/makebooks-ai/scraper => ../path/to/scraper

Import the package and stream results from Crawl. The API is intentionally small:

import (
  "context"
  "fmt"
  "net/url"
  "time"

  "github.com/makebooks-ai/scraper"
)

func Example() error {
  seed, _ := url.Parse("https://example.com")

  s := scraper.New(scraper.Config{
    Seed:          seed,
    DepthLimit:    1,
    SameHostOnly:  true,
    RespectRobots: true,
    NoFollow:      false,
    MaxPages:      0,           // unlimited
    RPS:           1.0,         // per-host
    Workers:       4,
    Timeout:       15 * time.Second,
    MaxRetries:    2,
    BackoffBase:   700 * time.Millisecond,
    UserAgent:     "scraper (+https://example.com)",
    // Optional:
    // Selectors: []string{"article h1", "article p"},
    // IncludeRe: regexp.MustCompile(`^https://example\.com/blog`),
    // ExcludeRe: regexp.MustCompile(`/page/`),
    // SaveHTMLDir: "./snapshots",
  })

  ch, err := s.Crawl(context.Background())
  if err != nil { return err }

  for rec := range ch {
    if rec.Error != "" {
      fmt.Println("ERR:", rec.URL, rec.Error)
      continue
    }
    fmt.Println(rec.Status, rec.Title, rec.URL)
  }
  return nil
}
scraper.Config
Field Type Meaning
Seed *url.URL Starting URL (required)
DepthLimit int Max link depth (0 = only seed)
SameHostOnly bool Only enqueue links on the seed host
RespectRobots bool Consult/cached robots.txt; drop disallowed paths
NoFollow bool Don’t enqueue links; just fetch/process current pages
MaxPages int Hard cap on unique pages (0 = unlimited)
RPS float64 Requests per second per host
Workers int Worker goroutines
Timeout time.Duration HTTP request timeout
MaxRetries int Retry count for 5xx/temporary errors
BackoffBase time.Duration Exponential backoff base
UserAgent string UA header to send
Selectors []string If non-empty, collect text from these CSS selectors; otherwise visible body text
SaveHTMLDir string If set, write raw HTML snapshots (hashed filenames)
IncludeRe *regexp.Regexp Enqueue only matching links
ExcludeRe *regexp.Regexp Skip matching links
scraper.Result
type Result struct {
  URL            string
  Status         int
  FetchedAt      time.Time
  Title          string
  MetaDesc       string
  OGTitle        string
  OGDescription  string
  ContentExcerpt string
  Links          []string
  Error          string // non-empty when fetch/parse blocked or failed
}

CLI

A thin wrapper around the library lives in cli/main.go.

Build
go build -o scraper ./cli/main.go
Usage
./scraper -url https://example.com -depth 1 -samehost -workers 8 -rps 1.5   -selectors "article h1,article p" -out pages.jsonl -save_html ./snapshots

Flags

Flag Default Description
-url (required) Seed URL
-depth 1 Max depth (0 = only seed)
-samehost false Only enqueue same-host links
-workers 4 Worker goroutines
-rps 1.0 Per-host requests/sec
-ua scraper (+https://example.com) User-Agent
-timeout 15s HTTP timeout
-selectors "" Comma-separated CSS selectors
-include "" Regex; only enqueue links that match
-exclude "" Regex; skip matching links
-out stdout JSONL output path
-save_html "" Folder to dump raw HTML
-retries 2 Max retries for transient errors
-backoff 700ms Backoff base
-robots true Respect robots.txt
-no_follow false Do not enqueue new links
-max_pages 0 Limit unique pages (0 = unlimited)
JSONL output schema

Each line is a single Result as JSON:

{"url":"https://example.com/","status":200,"fetched_at":"2025-01-01T12:00:00Z","title":"Home","meta_description":"...","og_title":"...","og_description":"...","content_excerpt":"...","links":["https://example.com/about"],"error":""}

Notes on politeness

  • Keep -rps and -workers modest.
  • Always respect a site’s Terms of Service and robots.txt.
  • Avoid scraping content behind logins or rate limits you don’t control.

Development

scraper/
├─ go.mod
├─ scraper.go     # importable library API
└─ cli/main.go    # CLI wiring to the library

Run tests:

go test ./... -v

Versioning

Follow semantic versions. For breaking API changes after v1, use Go’s semantic import versioning (/v2 suffix in the module path).


License

Apache 2.0. See LICENSE.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Seed          *url.URL
	DepthLimit    int
	SameHostOnly  bool
	RespectRobots bool
	NoFollow      bool
	MaxPages      int

	RPS         float64
	Workers     int
	Timeout     time.Duration
	MaxRetries  int
	BackoffBase time.Duration

	UserAgent   string
	Selectors   []string // if empty, falls back to visible text excerpt
	SaveHTMLDir string

	IncludeRe *regexp.Regexp
	ExcludeRe *regexp.Regexp
}

type Result

type Result struct {
	URL            string
	Status         int
	FetchedAt      time.Time
	Title          string
	MetaDesc       string
	OGTitle        string
	OGDescription  string
	ContentExcerpt string
	Links          []string
	Error          string
}

type Scraper

type Scraper struct {
	// contains filtered or unexported fields
}

func New

func New(cfg Config) *Scraper

func (*Scraper) Crawl

func (s *Scraper) Crawl(ctx context.Context) (<-chan Result, error)

Crawl streams page Results on a returned channel and closes it on completion. Callers can range over the channel until it closes.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL