CrawlObserver
Free, open-source SEO crawler built by SEObserver.
Extract 45+ SEO signals per page. Query millions of pages in milliseconds.
Quick Start ·
Web UI ·
CLI ·
Config ·
API ·
Contributing
Why CrawlObserver?
At SEObserver, we crawl billions of pages. We built CrawlObserver because every SEO deserves a proper crawler — one that stores data in a columnar database and lets you query millions of pages in milliseconds, even while the crawl is ongoing.
We're giving it to the community for free. Use it, break it, improve it.
What it does
- Crawls websites following internal links from seed URLs
- Extracts 45+ SEO signals per page (title, canonical, meta tags, headings, hreflang, Open Graph, schema.org, images, links, indexability...)
- Respects
robots.txt and per-host crawl delays
- Tracks redirect chains, response times, and body sizes
- Stores everything in a columnar database for instant analytical queries
- Computes PageRank and crawl depth per session
- Comes with a web UI, a REST API, and a native desktop app
Quick Start
curl -fsSL crawlobserver.com/install.sh | sh
./crawlobserver
That's it. Open http://127.0.0.1:8899 — the setup wizard guides you through the rest. CrawlObserver downloads and manages its own database on first run.
macOS desktop app: download the DMG from the latest release.
Windows
ClickHouse does not provide a native Windows binary, so CrawlObserver needs Docker to run the database:
- Install Docker Desktop (free)
- Download
crawlobserver-windows-amd64.exe from the latest release
- Open a terminal in the download folder and run:
docker compose up -d
.\crawlobserver-windows-amd64.exe serve
Build from source
Requires Go 1.25+:
git clone https://github.com/SEObserver/crawlobserver.git
cd crawlobserver
make build
./crawlobserver
Advanced: You can also point CrawlObserver at an existing database instance (Docker, remote server...). See the Configuration section for clickhouse.* settings.
Web UI
Start the web interface with ./crawlobserver serve and open http://127.0.0.1:8899.
The UI gives you:
- Session management — start, stop, resume, delete crawl sessions
- Page explorer — filter and browse crawled pages by status code, title, depth, word count...
- Tabs — overview, titles, meta, headings, images, indexability, response codes, internal links, external links
- PageRank — distribution histogram, treemap by path, top-N pages
- robots.txt tester — view robots.txt per host and test URL access
- Sitemap viewer — discover and browse sitemap trees
- Real-time progress — live crawl stats via Server-Sent Events
- Theming — custom accent color, logo, dark mode
- API key management — project-scoped keys for programmatic access
The UI is a single Go binary — no Node.js runtime needed in production.
CLI Reference
crawlobserver [command]
| Command |
Description |
crawl |
Start a crawl session |
serve |
Start the web server and browser UI |
migrate |
Create or update database tables |
sessions |
List all crawl sessions |
report external-links |
Export external links (table or CSV) |
update |
Check for updates and self-update |
install-clickhouse |
Download database binary for offline use |
version |
Print version |
Crawl examples
# Single seed URL
crawlobserver crawl --seed https://example.com
# Multiple seeds from file (one URL per line)
crawlobserver crawl --seeds-file urls.txt
# Fine-tune the crawl
crawlobserver crawl --seed https://example.com \
--workers 20 \
--delay 500ms \
--max-pages 50000 \
--max-depth 10 \
--store-html
Reports
# External links as a table
crawlobserver report external-links --format table
# Export to CSV
crawlobserver report external-links --format csv > external-links.csv
# Filter by session
crawlobserver report external-links --session <session-id> --format csv
Configuration
Copy config.example.yaml to config.yaml:
cp config.example.yaml config.yaml
All settings can be overridden via environment variables with the CRAWLOBSERVER_ prefix (e.g. CRAWLOBSERVER_CRAWLER_WORKERS=20) or via CLI flags.
Key settings
| Setting |
Default |
Description |
crawler.workers |
10 |
Concurrent fetch workers |
crawler.delay |
1s |
Per-host request delay |
crawler.max_pages |
0 |
Max pages to crawl (0 = unlimited) |
crawler.max_depth |
0 |
Max crawl depth (0 = unlimited) |
crawler.timeout |
30s |
HTTP request timeout |
crawler.user_agent |
CrawlObserver/1.0 |
User-Agent string |
crawler.respect_robots |
true |
Obey robots.txt |
crawler.store_html |
false |
Store raw HTML (ZSTD compressed) |
crawler.crawl_scope |
host |
host, domain (eTLD+1), or subdirectory |
clickhouse.host |
localhost |
Database host |
clickhouse.port |
19000 |
Database native protocol port |
clickhouse.mode |
(auto) |
managed, external, or auto-detect |
server.port |
8899 |
Web UI port |
server.username |
admin |
Basic auth username |
server.password |
(generated) |
Basic auth password (random if not set) |
resources.max_memory_mb |
0 |
Memory soft limit (0 = auto) |
resources.max_cpu |
0 |
CPU limit / GOMAXPROCS (0 = all) |
See config.example.yaml for the full reference.
Architecture
Seed URLs
|
v
Frontier (priority queue, per-host delay, dedup)
|
v
Fetch Workers (N goroutines, robots.txt cache, redirect tracking)
|
v
Parser (goquery: 45+ SEO signals extracted)
|
v
Storage Buffer (batch insert, configurable flush)
|
v
Columnar DB (partitioned by crawl session, managed automatically)
|
|---> Web UI (Svelte 5, embedded in binary)
|---> REST API (40+ endpoints)
|---> CLI reports
Why a columnar database?
A crawl is a link graph, so why not a graph database? Because a crawler is an analytics pipeline, not a graph explorer. The questions you ask are analytical — "show me all pages with a missing H1 and a 301 canonical", "give me PageRank percentiles by subdirectory" — and columnar databases answer these instantly, even over millions of rows.
When we need graph algorithms (PageRank, crawl depth), we compute them in-memory in Go and write the results back. A million-page link graph fits in ~200MB of RAM and computes in seconds — no need for a graph database.
Under the hood, CrawlObserver uses ClickHouse in managed mode: it downloads a static binary and runs it as a subprocess. You see one program; it gets concurrent read/write access, columnar compression (~10:1), and instant session deletion.
How internal PageRank works
CrawlObserver computes PageRank in-memory using the iterative power method (damping factor 0.85, up to 20 iterations, 1e-6 convergence threshold). The result is normalized to a 0–100 scale.
Key modeling choices:
-
External links dilute PR. When a page has outgoing links to external sites, those links are counted in the total outlink divisor. A page with 3 internal links and 7 external links passes PR/10 to each internal target — not PR/3. This correctly models the fact that link equity is split across all outgoing links, not just internal ones.
-
Nofollow / sponsored / UGC links dilute but do not pass PR. Links with rel="nofollow", rel="sponsored", or rel="ugc" are counted in the total outlink divisor (they consume link equity) but are excluded from the edge graph (they don't transfer it). This matches the "evaporating" model: nofollow links burn PageRank without redirecting it.
-
External-only pages are not dangling. A page that links only to external sites is not treated as a dangling node. Its rank leaks out of the internal graph instead of being redistributed. Only pages with zero outgoing links (true dead ends) trigger dangling-node redistribution.
-
Self-links are excluded. A page linking to itself does not count as an outgoing link for PageRank purposes.
These choices mean that CrawlObserver's internal PageRank is conservative: pages that link heavily to external sites or use nofollow on internal links will show lower PR flow than a naive internal-only model would suggest. We believe this better reflects how search engines handle link equity.
Tech stack
| Layer |
Technology |
| Crawler engine |
Go, net/http, goroutine pool, HTTP/2 (via utls ALPN negotiation) |
| TLS fingerprinting |
refraction-networking/utls (Chrome/Firefox/Edge profiles) |
| HTML parsing |
goquery (CSS selectors) |
| URL normalization |
purell + custom rules |
| robots.txt |
temoto/robotstxt |
| Storage |
ClickHouse (via clickhouse-go/v2) |
| API keys / sessions |
SQLite (modernc.org/sqlite) |
| Web UI |
Svelte 5, Vite (zero runtime dependencies) |
| Desktop app |
webview (macOS) |
| CLI |
Cobra + Viper |
API
The REST API is available when running crawlobserver serve. All endpoints are under /api/.
Sessions
| Method |
Endpoint |
Description |
GET |
/api/sessions |
List all sessions |
POST |
/api/crawl |
Start a new crawl |
POST |
/api/sessions/:id/stop |
Stop a running crawl |
POST |
/api/sessions/:id/resume |
Resume a stopped crawl |
DELETE |
/api/sessions/:id |
Delete a session and its data |
Pages & Links
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/pages |
Crawled pages (paginated, filterable) |
GET |
/api/sessions/:id/links |
External links |
GET |
/api/sessions/:id/internal-links |
Internal links |
GET |
/api/sessions/:id/page-detail?url= |
Full detail for one URL |
GET |
/api/sessions/:id/page-html?url= |
Raw HTML body |
Analytics
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/stats |
Session statistics |
GET |
/api/sessions/:id/events |
Live progress (SSE) |
POST |
/api/sessions/:id/compute-pagerank |
Compute internal PageRank |
POST |
/api/sessions/:id/recompute-depths |
Recompute crawl depths |
GET |
/api/sessions/:id/pagerank-top |
Top pages by PageRank |
GET |
/api/sessions/:id/pagerank-distribution |
PageRank histogram |
robots.txt & Sitemaps
| Method |
Endpoint |
Description |
GET |
/api/sessions/:id/robots-hosts |
Hosts with robots.txt |
GET |
/api/sessions/:id/robots-content |
robots.txt content |
POST |
/api/sessions/:id/robots-test |
Test URLs against robots.txt |
GET |
/api/sessions/:id/sitemaps |
Discovered sitemaps |
Authentication: Basic Auth or API key (X-API-Key header).
Contributing
We welcome contributions. Please read CONTRIBUTING.md before submitting anything.
TL;DR:
- Open an issue before starting significant work
- One PR = one thing (don't mix features and refactors)
- Write tests for new code
- Run
make test && make lint before pushing
- Follow existing code style — don't reorganize what you didn't change
Acknowledgments
Thanks to the people who helped shape CrawlObserver with their feedback, testing, and ideas:
License
AGPL-3.0 — see LICENSE.
Built by SEObserver.