scrpr
A fast CLI tool for extracting main content from websites. Supports multiple
extraction backends (local readability, Tavily, Jina Reader), browser cookie
integration, and UNIX pipe operations.
Features
- Multiple extraction backends - local readability (default), Tavily Extract API, Jina Reader API
- Clean content extraction using readability algorithms with intelligent newline cleaning
- Pipe-friendly - full UNIX pipe support, pairs with
sx for search-to-content pipelines
- Multiple output formats - text or Markdown
- Batch processing - process multiple URLs with progress, rate limiting, and error resilience
- Directory output - save each URL to its own file with
-o dir/
- Browser cookie integration - extract cookies from Chrome, Firefox, Safari, Zen
- Quiet mode -
-q suppresses all non-content output for clean piping
- Granular exit codes - 0=ok, 1=network, 2=parse, 3=input, 4=config, 5=io, 6=partial
Installation
git clone https://github.com/byteowlz/scrpr.git
cd scrpr
go build -o scrpr cmd/scrpr/main.go
Quick Start
# Extract content from a URL
scrpr https://example.com
# Output as Markdown
scrpr https://example.com --format markdown
# Use Jina Reader for JS-heavy sites (no API key needed)
scrpr https://example.com -B jina
# Pipe from sx search
sx "query" -L -n 5 | scrpr --format markdown
scrpr supports three extraction backends:
| Backend |
Flag |
Auth |
Best For |
| readability (default) |
-B readability |
None |
Fast, local, most sites |
| tavily |
-B tavily |
API key |
JS-heavy sites, better quality |
| jina |
-B jina |
Optional |
Free, no auth needed, decent quality |
Readability (Default)
Local extraction using go-readability. No API key needed. Works for most sites.
scrpr https://example.com
Cloud-based extraction that handles JavaScript-heavy sites better.
Requires an API key (free tier: 1,000 credits/month).
# Via config
# [extraction.tavily]
# api_key = "tvly-xxx"
# Via env var
export TAVILY_API_KEY="tvly-xxx"
scrpr https://js-heavy-site.com -B tavily --format markdown
Jina Reader API
Free extraction via r.jina.ai. Works without an API key (rate limited).
Optional key for higher limits.
scrpr https://example.com -B jina --format markdown
Usage
Basic
# Single URL
scrpr https://example.com
# Multiple URLs
scrpr https://a.com https://b.com
# From file
scrpr -f urls.txt
# From pipe
echo "https://example.com" | scrpr
cat urls.txt | scrpr --format markdown
Output Options
# Save to file
scrpr https://example.com -o article.md --format markdown
# Save each URL to its own file in a directory
scrpr https://a.com https://b.com -o articles/
# Include metadata
scrpr https://example.com --include-metadata
Batch Processing
# Rate limiting
scrpr -f urls.txt --delay 0.5
# Continue on error
scrpr -f urls.txt --continue-on-error
# Progress indicator
scrpr -f urls.txt --progress
# Quiet mode (content only, no stderr)
scrpr -f urls.txt -q
Pipelines with sx
# Search and extract content
sx "rust error handling" -L -n 5 | scrpr --format markdown
# Save to directory
sx "query" -L -n 5 | scrpr --format markdown -o articles/
# Use Jina for JS-heavy results
sx "query" -L -n 5 | scrpr -B jina --format markdown
# With rate limiting and error resilience
sx "query" -L -n 10 | scrpr --delay 0.5 --continue-on-error
# Quiet pipeline
sx "query" -L -n 5 | scrpr -q --format markdown > output.md
All Flags
Flags:
-B, --extract-backend string extraction backend (readability, tavily, jina)
-f, --file string read URLs from file
-o, --output string output to file or directory
--format string text or markdown (default "text")
--separator string separator for multiple URLs (default "---")
--null-separator null byte separator (for xargs -0)
-c, --concurrency int max concurrent requests (default 5)
--batch-size int process in batches of N
--progress show progress for batch processing
-b, --browser string browser for cookies (chrome/firefox/safari/zen)
--javascript force JS rendering
--no-js disable JS rendering
--skip-banners skip cookie banners (default true)
--timeout int request timeout in seconds (default 30)
--include-metadata include page metadata
--user-agent string custom user agent
--browser-agent string browser agent type
--continue-on-error continue on URL failures
--no-follow-redirects disable HTTP redirects
--delay float seconds between requests
-v, --verbose verbose output
-q, --quiet suppress non-content output
--config string config file path
Configuration
Config at $XDG_CONFIG_HOME/scrpr/config.toml (auto-created on first run).
[extraction]
backend = "readability" # readability, tavily, jina
min_content_length = 100
remove_ads = true
[extraction.tavily]
api_key = "" # or set TAVILY_API_KEY env var
extract_depth = "basic" # basic or advanced
[extraction.jina]
api_key = "" # optional, for higher rate limits
[output]
default_format = "text"
preserve_links = true
[network]
timeout = 30
browser_agent = "auto"
follow_redirects = true
delay = 0
[parallel]
max_concurrency = 5
show_progress = true
fail_fast = false
Exit Codes
| Code |
Meaning |
| 0 |
Success |
| 1 |
Network error |
| 2 |
Parse/extraction error |
| 3 |
Invalid input |
| 4 |
Config error |
| 5 |
File I/O error |
| 6 |
Partial success (some URLs failed) |
License
MIT License