crawler

command

v0.0.0-...-67d9479 Latest Latest Go to latest Published: Jul 1, 2014 License: MIT Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/senthilnayagam/go-web-developer

Links

Open Source Insights

README ¶

unicrawl is a simple single threaded crawler which saves the output in json file, html is base64 encoded

webcrawl is a simple http server which can crawl a url, later want to implement unicrawl into it


go build unicrawl.go

./unicrawl -u http://www.rediff.com -m rediff -p output



go build webcrawl
./webcrawl

open http://127.0.0.1:4040/Get/www.railsfactory.com

done
a) fetch page
b) extract links
c) save as json file
d) relative url to absolute url
e) limit crawling to pattern
f) not try crawling non http link schemes(eg mailto)
g) do not crawl already crawled link
h) recursion
i) get url and pattern and output folder from input(unicrawl)



todo


a) make it concurrent, run each request on a go routine

b) Use a proper HTTP client
* http client config, timeout, retry, user agent,cookies, header etc,
* speed throttling
* delay between requests
* limit concurrency to a fixed number
* use proxy if needed
* handle get, post
* follow redirects ?
* handle url with params
* timeout
* retry

c) load config from config file
d) log in file
e) depth first search / breadth first search
f) save in  database(optional)
g) robots.txt

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL