CrawlObserver
Free, open-source SEO crawler built by SEObserver.
Extract 45+ SEO signals per page. Store in ClickHouse. Analyze at scale.
Quick Start ·
Web UI ·
CLI ·
Config ·
API ·
Contributing
Why CrawlObserver?
At SEObserver, we crawl billions of pages. We built CrawlObserver because every SEO deserves a proper crawler — not a spreadsheet with 10,000 rows, not a SaaS with monthly limits. A real tool that runs on your machine, stores data in a columnar database, and lets you query millions of pages in milliseconds.
We're giving it to the community for free. Use it, break it, improve it.
What it does
- Crawls websites following internal links from seed URLs
- Extracts 45+ SEO signals per page (title, canonical, meta tags, headings, hreflang, Open Graph, schema.org, images, links, indexability...)
- Respects
robots.txt and per-host crawl delays
- Tracks redirect chains, response times, and body sizes
- Stores everything in ClickHouse (fast columnar queries over millions of pages)
- Computes PageRank and crawl depth per session
- Comes with a web UI, a REST API, and a native desktop app
Quick Start
Prerequisites: Go 1.25+ and Docker.
# 1. Clone & build
git clone https://github.com/SEObserver/crawlobserver.git
cd crawlobserver
make build
# 2. Start ClickHouse
docker compose up -d
# 3. Create tables
./crawlobserver migrate
# 4. Crawl a site
./crawlobserver crawl --seed https://example.com --max-pages 1000
# 5. Browse results
./crawlobserver serve
# Open http://127.0.0.1:8899
Managed mode: Don't have Docker? CrawlObserver can download and run ClickHouse for you automatically. Set clickhouse.mode: managed in your config. Supported on macOS (Intel & Apple Silicon) and Linux (x86_64 & ARM64). On Windows, use Docker or provide your own ClickHouse binary via clickhouse.binary_path.
Web UI
Start the web interface with ./crawlobserver serve and open http://127.0.0.1:8899.
The UI gives you:
- Session management — start, stop, resume, delete crawl sessions
- Page explorer — filter and browse crawled pages by status code, title, depth, word count...
- Tabs — overview, titles, meta, headings, images, indexability, response codes, internal links, external links
- PageRank — distribution histogram, treemap by path, top-N pages
- robots.txt tester — view robots.txt per host and test URL access
- Sitemap viewer — discover and browse sitemap trees
- Real-time progress — live crawl stats via Server-Sent Events
- Theming — custom accent color, logo, dark mode
- API key management — project-scoped keys for programmatic access
The UI is a single Go binary — no Node.js runtime needed in production.
CLI Reference
crawlobserver [command]
| Command |
Description |
crawl |
Start a crawl session |
serve |
Start the web UI |
gui |
Start the native desktop app (macOS) |
migrate |
Create or update ClickHouse tables |
sessions |
List all crawl sessions |
report external-links |
Export external links (table or CSV) |
update |
Check for updates and self-update |
install-clickhouse |
Download ClickHouse binary for offline use |
version |
Print version |
Crawl examples
# Single seed URL
crawlobserver crawl --seed https://example.com
# Multiple seeds from file (one URL per line)
crawlobserver crawl --seeds-file urls.txt
# Fine-tune the crawl
crawlobserver crawl --seed https://example.com \
--workers 20 \
--delay 500ms \
--max-pages 50000 \
--max-depth 10 \
--store-html
Reports
# External links as a table
crawlobserver report external-links --format table
# Export to CSV
crawlobserver report external-links --format csv > external-links.csv
# Filter by session
crawlobserver report external-links --session <session-id> --format csv
Configuration
Copy config.example.yaml to config.yaml:
cp config.example.yaml config.yaml
All settings can be overridden via environment variables with the CRAWLOBSERVER_ prefix (e.g. CRAWLOBSERVER_CRAWLER_WORKERS=20) or via CLI flags.
Key settings
| Setting |
Default |
Description |
crawler.workers |
10 |
Concurrent fetch workers |
crawler.delay |
1s |
Per-host request delay |
crawler.max_pages |
0 |
Max pages to crawl (0 = unlimited) |
crawler.max_depth |
0 |
Max crawl depth (0 = unlimited) |
crawler.timeout |
30s |
HTTP request timeout |
crawler.user_agent |
CrawlObserver/1.0 |
User-Agent string |
crawler.respect_robots |
true |
Obey robots.txt |
crawler.store_html |
false |
Store raw HTML (ZSTD compressed) |
crawler.crawl_scope |
host |
host (exact) or domain (eTLD+1) |
clickhouse.host |
localhost |
ClickHouse host |
clickhouse.port |
19000 |
ClickHouse native protocol port |
clickhouse.mode |
(auto) |
managed, external, or auto-detect |
server.port |
8899 |
Web UI port |
server.username |
admin |
Basic auth username |
server.password |
(generated) |
Basic auth password (random if not set) |
resources.max_memory_mb |
0 |
Memory soft limit (0 = auto) |
resources.max_cpu |
0 |
CPU limit / GOMAXPROCS (0 = all) |
See config.example.yaml for the full reference.
Architecture
Seed URLs
|
v
Frontier (priority queue, per-host delay, dedup)
|
v
Fetch Workers (N goroutines, robots.txt cache, redirect tracking)
|
v
Parser (goquery: 45+ SEO signals extracted)
|
v
Storage Buffer (batch insert, configurable flush)
|
v
ClickHouse (columnar storage, partitioned by month)
|
|---> Web UI (Svelte 5, embedded in binary)
|---> REST API (40+ endpoints)
|---> CLI reports
Why ClickHouse (and not SQLite, DuckDB, or a graph database)
A crawler has a hard architectural constraint: it writes and reads at the same time, continuously. During a crawl, 10+ goroutines batch-insert pages, links, and resources while the web UI polls for live progress every 100ms and users run analytical queries (filtered pages, audit aggregations, PageRank percentiles) on data that's still being written. This is a concurrent read/write workload on an analytical dataset — the worst case for embedded databases.
SQLite and DuckDB are single-writer. Under concurrent load, readers block writers or vice versa. You'd need to serialize access behind a mutex, which kills real-time monitoring — or accept that the UI freezes during crawls. ClickHouse is a client/server database: readers and writers never block each other, every goroutine gets its own connection from the pool, and the UI stays live throughout the crawl.
The trick is the managed mode: CrawlObserver downloads a ClickHouse static binary and runs it as a subprocess. The user sees one program; under the hood, there's a full analytical database server with concurrent access. Single-binary distribution, server-grade architecture.
The other benefits follow from this choice:
DROP PARTITION deletes a 10M-page session instantly (O(1) metadata operation, not a table scan)
joinGet() + Join engine computes PageRank server-side without round-tripping millions of URL strings to Go
- Columnar compression stores crawl data at ~10:1 ratios; ZSTD(3) on raw HTML bodies
- Built-in analytical functions (
countIf(), quantile(), domain(), arrayJoin()) replace what would be hundreds of lines of post-processing in Go
We sometimes get asked about graph databases (Neo4j, Dgraph) since a crawl is essentially a link graph. Our take: a crawler is an analytics pipeline, not a graph explorer. When we need graph algorithms (PageRank, BFS depth), we compute them in-memory in Go and write the results back. A million-page link graph fits in ~200MB of RAM and computes in seconds — no need for a second database.
Tech stack
| Layer |
Technology |
| Crawler engine |
Go, net/http, goroutine pool, HTTP/2 (via utls ALPN negotiation) |
| TLS fingerprinting |
refraction-networking/utls (Chrome/Firefox/Edge profiles) |
| HTML parsing |
goquery (CSS selectors) |
| URL normalization |
purell + custom rules |
| robots.txt |
temoto/robotstxt |
| Storage |
ClickHouse (via clickhouse-go/v2) |
| API keys / sessions |
SQLite (modernc.org/sqlite) |
| Web UI |
Svelte 5, Vite (zero runtime dependencies) |
| Desktop app |
webview (macOS) |
| CLI |
Cobra + Viper |
API
The REST API is available when running crawlobserver serve. All endpoints are under /api/.
Sessions
| Method |
Endpoint |
Description |
GET |
/api/sessions |
List all sessions |
POST |
/api/crawl |
Start a new crawl |
POST |
/api/sessions/:id/stop |
Stop a running crawl |
POST |
/api/sessions/:id/resume |
Resume a stopped crawl |
DELETE |
/api/sessions/:id |
Delete a session and its data |
Pages & Links
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/pages |
Crawled pages (paginated, filterable) |
GET |
/api/sessions/:id/links |
External links |
GET |
/api/sessions/:id/internal-links |
Internal links |
GET |
/api/sessions/:id/page-detail?url= |
Full detail for one URL |
GET |
/api/sessions/:id/page-html?url= |
Raw HTML body |
Analytics
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/stats |
Session statistics |
GET |
/api/sessions/:id/events |
Live progress (SSE) |
POST |
/api/sessions/:id/compute-pagerank |
Compute internal PageRank |
POST |
/api/sessions/:id/recompute-depths |
Recompute crawl depths |
GET |
/api/sessions/:id/pagerank-top |
Top pages by PageRank |
GET |
/api/sessions/:id/pagerank-distribution |
PageRank histogram |
robots.txt & Sitemaps
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/robots-hosts |
Hosts with robots.txt |
GET |
/api/sessions/:id/robots-content |
robots.txt content |
POST |
/api/sessions/:id/robots-test |
Test URLs against robots.txt |
GET |
/api/sessions/:id/sitemaps |
Discovered sitemaps |
Authentication: Basic Auth or API key (X-API-Key header).
Contributing
We welcome contributions. Please read CONTRIBUTING.md before submitting anything.
TL;DR:
- Open an issue before starting significant work
- One PR = one thing (don't mix features and refactors)
- Write tests for new code
- Run
make test && make lint before pushing
- Follow existing code style — don't reorganize what you didn't change
License
AGPL-3.0 — see LICENSE.
Built by SEObserver.