scrpr

module

v1.1.0 Latest Latest Go to latest Published: Feb 17, 2026 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/byteowlz/scrpr

Links

Open Source Insights

README ¶

scrpr

A fast CLI tool for extracting main content from websites. Supports multiple extraction backends (local readability, Tavily, Jina Reader), browser cookie integration, and UNIX pipe operations.

Features

Multiple extraction backends - local readability (default), Tavily Extract API, Jina Reader API
Clean content extraction using readability algorithms with intelligent newline cleaning
Pipe-friendly - full UNIX pipe support, pairs with sx for search-to-content pipelines
Multiple output formats - text or Markdown
Batch processing - process multiple URLs with progress, rate limiting, and error resilience
Directory output - save each URL to its own file with -o dir/
Browser cookie integration - extract cookies from Chrome, Firefox, Safari, Zen
Quiet mode - -q suppresses all non-content output for clean piping
Granular exit codes - 0=ok, 1=network, 2=parse, 3=input, 4=config, 5=io, 6=partial

Installation

git clone https://github.com/byteowlz/scrpr.git
cd scrpr
go build -o scrpr cmd/scrpr/main.go

Quick Start

# Extract content from a URL
scrpr https://example.com

# Output as Markdown
scrpr https://example.com --format markdown

# Use Jina Reader for JS-heavy sites (no API key needed)
scrpr https://example.com -B jina

# Pipe from sx search
sx "query" -L -n 5 | scrpr --format markdown

Extraction Backends

scrpr supports three extraction backends:

Backend	Flag	Auth	Best For
readability (default)	`-B readability`	None	Fast, local, most sites
tavily	`-B tavily`	API key	JS-heavy sites, better quality
jina	`-B jina`	Optional	Free, no auth needed, decent quality

Readability (Default)

Local extraction using go-readability. No API key needed. Works for most sites.

scrpr https://example.com

Tavily Extract API

Cloud-based extraction that handles JavaScript-heavy sites better. Requires an API key (free tier: 1,000 credits/month).

# Via config
# [extraction.tavily]
# api_key = "tvly-xxx"

# Via env var
export TAVILY_API_KEY="tvly-xxx"

scrpr https://js-heavy-site.com -B tavily --format markdown

Jina Reader API

Free extraction via r.jina.ai. Works without an API key (rate limited). Optional key for higher limits.

scrpr https://example.com -B jina --format markdown

Usage

Basic

# Single URL
scrpr https://example.com

# Multiple URLs
scrpr https://a.com https://b.com

# From file
scrpr -f urls.txt

# From pipe
echo "https://example.com" | scrpr
cat urls.txt | scrpr --format markdown

Output Options

# Save to file
scrpr https://example.com -o article.md --format markdown

# Save each URL to its own file in a directory
scrpr https://a.com https://b.com -o articles/

# Include metadata
scrpr https://example.com --include-metadata

Batch Processing

# Rate limiting
scrpr -f urls.txt --delay 0.5

# Continue on error
scrpr -f urls.txt --continue-on-error

# Progress indicator
scrpr -f urls.txt --progress

# Quiet mode (content only, no stderr)
scrpr -f urls.txt -q

Pipelines with sx

# Search and extract content
sx "rust error handling" -L -n 5 | scrpr --format markdown

# Save to directory
sx "query" -L -n 5 | scrpr --format markdown -o articles/

# Use Jina for JS-heavy results
sx "query" -L -n 5 | scrpr -B jina --format markdown

# With rate limiting and error resilience
sx "query" -L -n 10 | scrpr --delay 0.5 --continue-on-error

# Quiet pipeline
sx "query" -L -n 5 | scrpr -q --format markdown > output.md

All Flags

Flags:
  -B, --extract-backend string   extraction backend (readability, tavily, jina)
  -f, --file string              read URLs from file
  -o, --output string            output to file or directory
      --format string            text or markdown (default "text")
      --separator string         separator for multiple URLs (default "---")
      --null-separator           null byte separator (for xargs -0)
  -c, --concurrency int          max concurrent requests (default 5)
      --batch-size int           process in batches of N
      --progress                 show progress for batch processing
  -b, --browser string           browser for cookies (chrome/firefox/safari/zen)
      --javascript               force JS rendering
      --no-js                    disable JS rendering
      --skip-banners             skip cookie banners (default true)
      --timeout int              request timeout in seconds (default 30)
      --include-metadata         include page metadata
      --user-agent string        custom user agent
      --browser-agent string     browser agent type
      --continue-on-error        continue on URL failures
      --no-follow-redirects      disable HTTP redirects
      --delay float              seconds between requests
  -v, --verbose                  verbose output
  -q, --quiet                    suppress non-content output
      --config string            config file path

Configuration

Config at $XDG_CONFIG_HOME/scrpr/config.toml (auto-created on first run).

[extraction]
backend = "readability"          # readability, tavily, jina
min_content_length = 100
remove_ads = true

[extraction.tavily]
api_key = ""                     # or set TAVILY_API_KEY env var
extract_depth = "basic"          # basic or advanced

[extraction.jina]
api_key = ""                     # optional, for higher rate limits

[output]
default_format = "text"
preserve_links = true

[network]
timeout = 30
browser_agent = "auto"
follow_redirects = true
delay = 0

[parallel]
max_concurrency = 5
show_progress = true
fail_fast = false

Exit Codes

Code	Meaning
0	Success
1	Network error
2	Parse/extraction error
3	Invalid input
4	Config error
5	File I/O error
6	Partial success (some URLs failed)

License

MIT License

Directories ¶

Path	Synopsis
cmd
scrpr command
internal
browser
config
extractor
fetcher
processor
pkg
extractor

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL