spartan-scraper

module

v0.0.0-...-e405e22 Latest Latest Go to latest Published: Jun 4, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fitchmultz/spartan-scraper

Links

Open Source Insights

README ¶

Spartan Scraper

Spartan Scraper is a local-first scraping workbench for turning a URL into a clean result, a bounded crawl, or a research job without standing up cloud infrastructure.

It is built for people who want one dependable workflow from fetch to stored artifacts: open the UI or CLI, submit work, inspect results locally, promote a verified job into reusable automation, and reopen saved watch/export history when daily operations or failure recovery matter.

A healthy first run works without any AI, proxy-pool, or retention setup. Those are optional subsystems you can enable later when a workflow actually calls for them.

If you want the fastest path in, start with the 5-minute demo below. If you are integrating it into a real workflow, the API, MCP server, schedules, and local artifact model all build on that same core path.

Planning and future work live in docs/roadmap.md. That document is the canonical source of truth for what is in flight, next, and explicitly out of scope for the current cutover.

Why It Exists

Start from a URL and get something useful quickly: extracted content, crawl output, or a research bundle.
Keep everything local by default: jobs, artifacts, auth profiles, schedules, and render rules stay on disk.
Use the same job model everywhere: Web UI, CLI, API, TUI, and MCP all operate on the same persisted workflows.
Stay practical for real sites: HTTP-first by default, Chromedp/Playwright when pages are JS-heavy or need login flows.

5-Minute Demo

This walkthrough uses the default out-of-the-box path. Leave AI, proxy pooling, and retention off; they are optional and not needed for the first successful scrape.

git clone https://github.com/fitchmultz/spartan-scraper.git
cd spartan-scraper

make verify-toolchain
make install
make generate
make build

# terminal 1
./bin/spartan server

# terminal 2
make web-dev

If startup detects a legacy or unsupported .data directory, Spartan now serves guided setup mode instead of failing only in the terminal. Run ./bin/spartan reset-data once to archive the current data directory under output/cutover/ and recreate .data for a fresh Balanced 1.0 boot.

Open http://localhost:5173, submit a scrape job for https://example.com, and expect:

the dashboard to show a new job move into succeeded
the results panel to include Example Domain
the metrics widget to show a live WebSocket connection

If you want a CLI-only proof first:

./bin/spartan scrape --url https://example.com --out ./out/example.json
cat ./out/example.json

Expected output includes the page title text Example Domain.

For a more guided version with expected checkpoints, see docs/demo.md.

Validated Operator Workflow

The current operator path is:

submit a scrape, crawl, or research job
inspect the saved result on the job detail route
promote that verified job into a template, watch, or export schedule draft
save the automation from its destination surface
inspect persisted watch checks and export outcomes from /automation/watches and /automation/exports
follow guided recovery actions when a watch check or export run fails

The README keeps the fast first-success path near the top. For the guided continuation into promotion, saved automation, history inspection, and failure recovery, use:

What It Covers

Single pages, full websites, and deep research workflows.
Works for static HTML and JS‑heavy sites (headless Chromium or Playwright).
Unified interfaces: CLI, TUI, and Web UI.
Clean API contract (OpenAPI) with generated TS client.
Local, self‑hosted, no SaaS dependencies.
Webhook integrations that now distinguish JSON job events from multipart export deliveries, so downstream receivers get actual exported bytes on export.completed instead of placeholder path metadata.

Project Status

Spartan Scraper is in 1.0.0-rc1 release prep. The latest published release-readiness evidence snapshot lives in docs/evidence/release-readiness/2026-04-05/README.md.

Quickstart

# Quick install (CLI-focused; requires Go 1.26+)
go install github.com/fitchmultz/spartan-scraper/cmd/spartan@latest

# Full local setup (recommended for contributors and operators)
git clone https://github.com/fitchmultz/spartan-scraper.git
cd spartan-scraper
make verify-toolchain
make install
make generate
make build
./bin/spartan --help
./bin/spartan server
make web-dev

If the default .data directory came from a pre-cutover build, reset it with:

./bin/spartan reset-data

Open http://localhost:5173, submit a scrape for https://example.com, and confirm the saved result contains Example Domain.

For a full local validation pass, run:

make ci-pr

Optional: install the binary into ~/.local/bin (or $XDG_BIN_HOME) with:

make install-bin

Developer And Agent Workflows

Agents get an MCP surface, a deterministic local API, and a persistent job store they can inspect and reuse.
Developers get one local system for UI, CLI, and API validation instead of separate throwaway scripts.
Saved results can be re-authored locally too: bounded AI helpers can now refine research output, shape recurring exports, and generate validated JMESPath/JSONata transforms from representative persisted artifacts without launching new jobs.
Teams get reproducible CI, generated API types, and stored artifacts that make behavior easier to verify.

Community

LICENSE - Apache License 2.0
NOTICE - Apache 2.0 notice file
CONTRIBUTING.md - How to contribute
CODE_OF_CONDUCT.md - Code of conduct
SECURITY.md - Security policy
CHANGELOG.md - Release history
RELEASING.md - Release workflow

Documentation

docs/README.md: docs index and navigation.
docs/usage.md: CLI/API/Web/MCP entry points and flags.
docs/architecture.md: system structure and flow.
docs/demo.md: a fast clone-to-value walkthrough that continues into promotion and saved automation.
docs/operator.md: canonical operator workflow for promotion, saved automation, history inspection, and failure recovery.
docs/validation_checklist.md: copy/paste validation steps for setup, runtime checks, and public-readiness smoke tests.
RELEASING.md: release workflow and pre-tag checklist.
docs/ci.md: CI tiers, runtime expectations, and resource profile guidance.
docs/performance.md: tuning and scaling guidance.
docs/landscape.md: ecosystem positioning and design trade-offs.

CLI examples

# Single page scrape (HTTP)
./bin/spartan scrape \
  --url https://example.com \
  --out ./out/example.json

# Headless scrape with login (form-based)
./bin/spartan scrape \
  --url https://example.com/dashboard \
  --headless \
  --playwright \
  --login-url https://example.com/login \
  --login-user-selector '#email' \
  --login-pass-selector '#password' \
  --login-submit-selector 'button[type=submit]' \
  --login-user you@example.com \
  --login-pass 'demo-password' \
  --out ./out/dashboard.json

# Crawl a site (depth-limited)
./bin/spartan crawl \
  --url https://example.com \
  --max-depth 2 \
  --max-pages 200 \
  --out ./out/site.jsonl

# Deep research
./bin/spartan research \
  --query "pricing model" \
  --urls https://example.com,https://example.com/docs \
  --out ./out/research.jsonl

# If you later load a proxy pool, prefer residential us-east proxies from it for one request
./bin/spartan scrape \
  --url https://example.com \
  --proxy-region us-east \
  --proxy-tag residential \
  --out ./out/example.json

# MCP server (stdio)
./bin/spartan mcp

# Auth profiles
./bin/spartan auth set --name acme --auth-basic user:pass --header "X-API-Key: token-from-provider"
./bin/spartan scrape --url https://example.com --auth-profile acme

# Extraction Templates
./bin/spartan scrape --url https://example.com/product/123 --extract-template product
./bin/spartan scrape --url https://example.com --extract-config my-template.json

# Export results
./bin/spartan export --job-id <id> --format md --out ./out/report.md
./bin/spartan export --job-id <id> --schedule-id <export-schedule-id>
./bin/spartan export --inspect-id <export-id>
./bin/spartan export --history-job-id <job-id>
./bin/spartan export-schedule history --id <export-schedule-id>

# Watch inspection
./bin/spartan watch check <watch-id>
./bin/spartan watch history <watch-id>
./bin/spartan watch history <watch-id> --check-id <check-id>

# Schedules
./bin/spartan schedule add --kind scrape --interval 3600 --url https://example.com
./bin/spartan schedule list

# Run API server + background worker (API binds to localhost by default)
./bin/spartan server

# Archive and recreate a legacy/default .data directory for Balanced 1.0
./bin/spartan reset-data

# Expose API on all interfaces (use with caution)
# API key auth is auto-enforced when BIND_ADDR is non-localhost.
BIND_ADDR=0.0.0.0 ./bin/spartan server

# Launch TUI
./bin/spartan tui

Web UI

./bin/spartan server
# in a second terminal:
make web-dev

Open http://localhost:5173 for the UI.

Validated operator continuation after a successful job:

/jobs/:id is the promotion starting point for templates, watches, and export schedules
/templates is where promoted templates are saved and previewed
/automation/watches is where promoted watches are saved, checked, and inspected through persisted history
/automation/exports is where promoted export schedules are saved and where export outcome history stays inspectable across success and failure states

The dev server proxies API requests to http://localhost:8741 by default. WebSocket upgrades to /v1/ws accept browser origins from loopback hosts only (localhost, 127.0.0.1, ::1). Non-browser clients without an Origin header remain supported. If you run the backend on a different local port, set DEV_API_PROXY_TARGET=http://127.0.0.1:<port> in web/.env so the dev proxy stays same-origin. Use VITE_API_BASE_URL only for deployed cross-origin builds where the browser should call a remote API directly.

Repo-local AI defaults live in .env and config/pi-routes.json, but core scrape/crawl/research workflows work without AI. Leave PI_ENABLED=false for the default first run; when you later set PI_ENABLED=true, Spartan asks pi for routes in this order: kimi-coding/kimi-for-coding, zai/glm-5.1, openai/gpt-5.5. Auth, account selection, and billing stay in pi; if you want a different route order or different provider/model IDs, override PI_CONFIG_PATH or edit that routes file locally.

Proxy pooling and retention are optional too: leave PROXY_POOL_FILE unset and RETENTION_ENABLED=false until you actually need pooled routing or automated cleanup.

Interfaces

Web UI for job submission, monitoring, automation, and settings
CLI for scripting and local automation
REST + WebSocket APIs for integrations
MCP server for agent orchestration
TUI for terminal-first inspection

Architecture at a glance

cmd/spartan: main entrypoint for CLI/TUI/API server.
internal/fetch: HTTP + headless fetchers (Chromedp + Playwright).
internal/extract: HTML parsing + text/metadata extraction.
internal/crawl: BFS crawler with depth/limit and domain scoping (robots ignored by default; opt-in support available).
internal/jobs: persistent job store + queue runner.
internal/api: HTTP API + OpenAPI contract.
web: Vite + React UI; generated API client under web/src/api.
internal/research: multi-source workflow (scrape/crawl → evidence → simhash dedup → clustering → citations + confidence → summary).
internal/mcp: MCP stdio server for agent orchestration.
internal/exporter: exports results to json/jsonl/csv/markdown/xlsx.
internal/scheduler: recurring job runner with interval schedules.

Notes

Robots.txt is ignored by default; enable compliance with --respect-robots or RESPECT_ROBOTS_TXT=true.
Auth support: headers, cookies, basic auth, tokens, query params, and form login (headless).
Auth vault is stored in .data/auth_vault.json (profiles + presets + inheritance).
Render profiles (adaptive rules) are stored in .data/render_profiles.json.
Rate limiting + retries are configurable via .env.
Playwright can be enabled with USE_PLAYWRIGHT=1 or --playwright (CLI/API). Install browsers with make install-playwright. make ci-slow provisions Playwright automatically for clean-machine heavy runs.

Toolchain

Pinned in .tool-versions:

Go 1.26.2
Node 25.9 (any 25.9.x patch)
pnpm 10.33.0

Use a .tool-versions-compatible version manager (for example mise install) to provision those pinned versions, then run make verify-toolchain before build/test work.

Local CI

GitHub workflow split:

PR required: .github/workflows/ci-pr.yml (make ci-pr)
Nightly/manual heavy checks: .github/workflows/ci-slow.yml (make ci-slow, deterministic local-fixture heavy lane that provisions Playwright browsers)

make verify-toolchain  # Print and enforce the exact Go/Node/pnpm contract from .tool-versions
make audit-public      # Scan tracked files + branch history for public-readiness leaks/secrets/placeholders
make secret-scan       # Deep git-history secret scan (manual/nightly release-tier check)
make ci-pr             # PR-equivalent deterministic gate (requires clean git state)
make ci                # Full local gate (Go + web + pi-bridge install/build/tests)
make ci-slow           # Deterministic heavy stress/e2e checks (local fixture; provisions Playwright)
make ci-network        # Optional live-Internet smoke validation
CI_VITEST_MAX_WORKERS=2 make ci-pr  # Optional local worker cap override
make ci-manual         # Manual full heavy sweep (ci-slow + ci-network)

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
spartan command Package main provides the command-line entry point for Spartan Scraper.	Package main provides the command-line entry point for Spartan Scraper.
internal
ai Package ai manages the bridge process used for pi-backed LLM operations.	Package ai manages the bridge process used for pi-backed LLM operations.
aiauthoring Package aiauthoring implements bounded AI-assisted authoring for automation artifacts.	Package aiauthoring implements bounded AI-assisted authoring for automation artifacts.
analytics Package analytics provides historical metrics collection and aggregation.	Package analytics provides historical metrics collection and aggregation.
api Package api provides HTTP handlers for bounded AI authoring endpoints.	Package api provides HTTP handlers for bounded AI authoring endpoints.
apperrors Package apperrors provides classified error handling infrastructure.	Package apperrors provides classified error handling infrastructure.
artifacts Package artifacts manages canonical job artifact metadata.	Package artifacts manages canonical job artifact metadata.
auth Package auth provides authentication profile management and credential resolution.	Package auth provides authentication profile management and credential resolution.
buildinfo Package buildinfo provides information about the current build of the application.	Package buildinfo provides information about the current build of the application.
captcha Package captcha provides CAPTCHA detection and solving service integration.	Package captcha provides CAPTCHA detection and solving service integration.
cli Package cli provides the Spartan Scraper command-line interface router.	Package cli provides the Spartan Scraper command-line interface router.
cli/ai Package ai implements the Spartan CLI subcommands for bounded AI authoring workflows.	Package ai implements the Spartan CLI subcommands for bounded AI authoring workflows.
cli/batch Package batch provides CLI commands for batch job operations.	Package batch provides CLI commands for batch job operations.
cli/common Package common provides shared CLI helpers used across command modules.	Package common provides shared CLI helpers used across command modules.
cli/manage Package manage contains CLI commands for configuration/data management (auth/export/templates/states/jobs/schedule).	Package manage contains CLI commands for configuration/data management (auth/export/templates/states/jobs/schedule).
cli/scrape Package scrape contains crawl CLI command wiring.	Package scrape contains crawl CLI command wiring.
cli/server Package server contains health CLI command wiring.	Package server contains health CLI command wiring.
config Purpose: Load and validate AI bridge startup configuration independently from the rest of the environment loader.	Purpose: Load and validate AI bridge startup configuration independently from the rest of the environment loader.
crawl Package crawl provides URL pattern matching for crawl filtering.	Package crawl provides URL pattern matching for crawl filtering.
dedup Package dedup provides cross-job content deduplication using simhash.	Package dedup provides cross-job content deduplication using simhash.
diff Package diff provides content diffing functionality for change detection.	Package diff provides content diffing functionality for change detection.
exporter Package exporter provides CSV export implementation.	Package exporter provides CSV export implementation.
extract Package extract provides caching for AI extraction results to reduce API costs.	Package extract provides caching for AI extraction results to reduce API costs.
fetch Package fetch provides HTTP and headless browser content fetching capabilities.	Package fetch provides HTTP and headless browser content fetching capabilities.
fsutil Package fsutil provides filesystem utilities for secure data directory management.	Package fsutil provides filesystem utilities for secure data directory management.
hostmatch Package hostmatch provides centralized host extraction and pattern matching utilities.	Package hostmatch provides centralized host extraction and pattern matching utilities.
jobs Package jobs provides job creation and persistence logic for scrape, crawl, and research jobs.	Package jobs provides job creation and persistence logic for scrape, crawl, and research jobs.
mcp Package mcp exposes structured runtime diagnostics over the MCP tool surface.	Package mcp exposes structured runtime diagnostics over the MCP tool surface.
model Package model defines shared domain types for batch job operations.	Package model defines shared domain types for batch job operations.
paramdecode Package paramdecode centralizes typed reads from persisted parameter maps.	Package paramdecode centralizes typed reads from persisted parameter maps.
pipeline Package pipeline provides a plugin system for extending scrape and crawl workflows.	Package pipeline provides a plugin system for extending scrape and crawl workflows.
queue Package queue provides pluggable queue backends for job distribution.	Package queue provides pluggable queue backends for job distribution.
research Package research provides research functionality for Spartan Scraper.	Package research provides research functionality for Spartan Scraper.
retention Package retention provides data retention policy enforcement.	Package retention provides data retention policy enforcement.
runtime Purpose: Initialize the fully wired runtime job manager used by local execution surfaces.	Purpose: Initialize the fully wired runtime job manager used by local execution surfaces.
scheduler Package scheduler provides an in-memory cache for schedules with file watching.	Package scheduler provides an in-memory cache for schedules with file watching.
scrape Package scrape provides functionality for scraping a single web page.	Package scrape provides functionality for scraping a single web page.
simhash Package simhash provides content similarity detection using simhash algorithm.	Package simhash provides content similarity detection using simhash algorithm.
store Package store provides SQLite-backed persistent storage for analytics data.	Package store provides SQLite-backed persistent storage for analytics data.
submission Package submission validates operator-facing batch requests and converts them into canonical create-time jobs.JobSpec values.	Package submission validates operator-facing batch requests and converts them into canonical create-time jobs.JobSpec values.
testharness Package testharness provides shared heavy-validation helpers for deterministic system and e2e suites.	Package testharness provides shared heavy-validation helpers for deterministic system and e2e suites.
testsite Package testsite provides a deterministic local HTTP fixture for end-to-end and stress tests.	Package testsite provides a deterministic local HTTP fixture for end-to-end and stress tests.
ui/tui Package tui provides entry points for running the TUI.	Package tui provides entry points for running the TUI.
validate Package validate provides request validators for scrape, crawl, and research operations.	Package validate provides request validators for scrape, crawl, and research operations.
watch Package watch manages persisted watch-check artifacts.	Package watch manages persisted watch-check artifacts.
webassets Package webassets contains repository-level checks for the static web shell.	Package webassets contains repository-level checks for the static web shell.
webhook Package webhook resolves and pins webhook delivery targets.	Package webhook resolves and pins webhook delivery targets.
scripts Command serve_testsite runs the deterministic local fixture server used by stress and e2e validation.	Command serve_testsite runs the deterministic local fixture server used by stress and e2e validation.