scraper
A polite, concurrent web scraper you can use as a Go library or as a CLI.
- Per-host rate limiting (RPS)
robots.txt compliance (cached per host)
- Depth-limited crawling; same-host restriction
- Include/Exclude regex for link enqueues
- Retries with exponential backoff
- Extracts
<title>, meta/OG tags, visible text or custom CSS selectors
- Streams JSON Lines (one page per line) from the CLI
- Optional raw HTML snapshots to disk
Install
Choose a module path (examples use github.com/makebooks-ai/scraper).
go get github.com/makebooks-ai/scraper@latest
Local development from another repo? Add a replace in the consumer’s go.mod:
replace github.com/makebooks-ai/scraper => ../path/to/scraper
Library (recommended)
Import the package and stream results from Crawl. The API is intentionally small:
import (
"context"
"fmt"
"net/url"
"time"
"github.com/makebooks-ai/scraper"
)
func Example() error {
seed, _ := url.Parse("https://example.com")
s := scraper.New(scraper.Config{
Seed: seed,
DepthLimit: 1,
SameHostOnly: true,
RespectRobots: true,
NoFollow: false,
MaxPages: 0, // unlimited
RPS: 1.0, // per-host
Workers: 4,
Timeout: 15 * time.Second,
MaxRetries: 2,
BackoffBase: 700 * time.Millisecond,
UserAgent: "scraper (+https://example.com)",
// Optional:
// Selectors: []string{"article h1", "article p"},
// IncludeRe: regexp.MustCompile(`^https://example\.com/blog`),
// ExcludeRe: regexp.MustCompile(`/page/`),
// SaveHTMLDir: "./snapshots",
})
ch, err := s.Crawl(context.Background())
if err != nil { return err }
for rec := range ch {
if rec.Error != "" {
fmt.Println("ERR:", rec.URL, rec.Error)
continue
}
fmt.Println(rec.Status, rec.Title, rec.URL)
}
return nil
}
scraper.Config
| Field |
Type |
Meaning |
Seed |
*url.URL |
Starting URL (required) |
DepthLimit |
int |
Max link depth (0 = only seed) |
SameHostOnly |
bool |
Only enqueue links on the seed host |
RespectRobots |
bool |
Consult/cached robots.txt; drop disallowed paths |
NoFollow |
bool |
Don’t enqueue links; just fetch/process current pages |
MaxPages |
int |
Hard cap on unique pages (0 = unlimited) |
RPS |
float64 |
Requests per second per host |
Workers |
int |
Worker goroutines |
Timeout |
time.Duration |
HTTP request timeout |
MaxRetries |
int |
Retry count for 5xx/temporary errors |
BackoffBase |
time.Duration |
Exponential backoff base |
UserAgent |
string |
UA header to send |
Selectors |
[]string |
If non-empty, collect text from these CSS selectors; otherwise visible body text |
SaveHTMLDir |
string |
If set, write raw HTML snapshots (hashed filenames) |
IncludeRe |
*regexp.Regexp |
Enqueue only matching links |
ExcludeRe |
*regexp.Regexp |
Skip matching links |
scraper.Result
type Result struct {
URL string
Status int
FetchedAt time.Time
Title string
MetaDesc string
OGTitle string
OGDescription string
ContentExcerpt string
Links []string
Error string // non-empty when fetch/parse blocked or failed
}
CLI
A thin wrapper around the library lives in cli/main.go.
Build
go build -o scraper ./cli/main.go
Usage
./scraper -url https://example.com -depth 1 -samehost -workers 8 -rps 1.5 -selectors "article h1,article p" -out pages.jsonl -save_html ./snapshots
Flags
| Flag |
Default |
Description |
-url |
(required) |
Seed URL |
-depth |
1 |
Max depth (0 = only seed) |
-samehost |
false |
Only enqueue same-host links |
-workers |
4 |
Worker goroutines |
-rps |
1.0 |
Per-host requests/sec |
-ua |
scraper (+https://example.com) |
User-Agent |
-timeout |
15s |
HTTP timeout |
-selectors |
"" |
Comma-separated CSS selectors |
-include |
"" |
Regex; only enqueue links that match |
-exclude |
"" |
Regex; skip matching links |
-out |
stdout |
JSONL output path |
-save_html |
"" |
Folder to dump raw HTML |
-retries |
2 |
Max retries for transient errors |
-backoff |
700ms |
Backoff base |
-robots |
true |
Respect robots.txt |
-no_follow |
false |
Do not enqueue new links |
-max_pages |
0 |
Limit unique pages (0 = unlimited) |
JSONL output schema
Each line is a single Result as JSON:
{"url":"https://example.com/","status":200,"fetched_at":"2025-01-01T12:00:00Z","title":"Home","meta_description":"...","og_title":"...","og_description":"...","content_excerpt":"...","links":["https://example.com/about"],"error":""}
Notes on politeness
- Keep
-rps and -workers modest.
- Always respect a site’s Terms of Service and
robots.txt.
- Avoid scraping content behind logins or rate limits you don’t control.
Development
scraper/
├─ go.mod
├─ scraper.go # importable library API
└─ cli/main.go # CLI wiring to the library
Run tests:
go test ./... -v
Versioning
Follow semantic versions. For breaking API changes after v1, use Go’s semantic import versioning (/v2 suffix in the module path).
License
Apache 2.0. See LICENSE.