crawler

command
v0.0.0-...-67d9479 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 1, 2014 License: MIT Imports: 17 Imported by: 0

README

unicrawl is a simple single threaded crawler which saves the output in json file, html is base64 encoded

webcrawl is a simple http server which can crawl a url, later want to implement unicrawl into it


go build unicrawl.go

./unicrawl -u http://www.rediff.com -m rediff -p output



go build webcrawl
./webcrawl

open http://127.0.0.1:4040/Get/www.railsfactory.com

done
a) fetch page
b) extract links
c) save as json file
d) relative url to absolute url
e) limit crawling to pattern
f) not try crawling non http link schemes(eg mailto)
g) do not crawl already crawled link
h) recursion
i) get url and pattern and output folder from input(unicrawl)



todo


a) make it concurrent, run each request on a go routine

b) Use a proper HTTP client
* http client config, timeout, retry, user agent,cookies, header etc,
* speed throttling
* delay between requests
* limit concurrency to a fixed number
* use proxy if needed
* handle get, post
* follow redirects ?
* handle url with params
* timeout
* retry

c) load config from config file
d) log in file
e) depth first search / breadth first search
f) save in  database(optional)
g) robots.txt


Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL