crawler

module
v0.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2025 License: MIT

README

crawler

This is a web crawler I wrote that allows for large scale concurrency while being stack safe and polite.

Requirements

Installation

Run the following to import the package into your project:

go get github.com/junwei890/crawler@latest

This exposes several functions that you can use, but most likely, you would only need to use:

func StartCrawl(dbURI string, links []string) error

Usage

To use the function above, give it your MongoDB URI and a slice of links you would like to scrape. Ensure that all links have their protocol intact.

Inner workings

Program entry

Once the MongoDB URI and sites have been passed, a database connection is established and a Goroutine is created to crawl each site, up to a thousand.

Robots.txt

For each site, a GET request is made for its robots.txt file, this file outlines which routes a crawler can and cannot access as well as the crawl delay it should abide by.

Based on the response, one of several things could happen:

  • 403: The site doesn't want us crawling so we won't.
  • 404: There's no robots.txt file so we will be crawling the site.
  • Malformed or no Content-Type headers: The site won't be crawled.

If all these pass, the file is passed through a parser where rules are extracted.

Breadth First Traversal

A breadth first traversal was chosen over a recursive depth first one. This is because Go isn't tail call optimized, it allocates a new stack on each recursive call instead of reusing the previous one, thus using a depth first traversal could potentially crash our program if sites are massive.

This decision impaired performance since we couldn't crawl each route in a separate Goroutine, however, it gave us much better stack safety since the stack grows only with queue size.

Early returns

The crawling takes place in an infinite for loop until the queue is empty. Before getting and parsing HTML, several checks are done:

  • Checks if queue is empty.
  • Checks if we are still within the same hostname.
  • Checks if we have already visited this route or if are even allowed to visit this route.

If any of the above is satisfied, we head to the next for loop iteration.

HTML

Once a route makes it through early returns, a GET request is made for the route's HTML, if the route responds with a 400 to 499 status code or if the Content-Type in the response header is not text/html, we skip over to the next for loop iteration.

The retrieved HTML is then passed through a parser that extracts the title, content and outgoing links. The title and content are unmarshalled into a struct and temporarily stored in a slice while the links are enqueued.

Post-crawling

Once each site exits the for loop, titles and content we extracted are bulk inserted into MongoDB, with the database and collection creation automated.

Once all sites have been crawled, the collection is then automatically indexed for Atlas Search.

The crawler builds on top of the database, collection and index that was created on the first successful run on subsequent program executions. All of this is handled by the crawler.

Exposed functions

func Normalize(rawURL string) (string, error)

Standardises URL structure, turns URLs like "http://www.site.com/" to "www.site.com". See utils_test for edge cases.

func GetHTML(rawURL string) ([]byte, error)

Makes a GET request for the pages HTML and returns it as a slice of bytes. Strict checks are done before returning.

type Response struct {
    Title string
    Content []string
    Links []string
}

func ParseHTML(domain *url.URL, page []byte) (Response, error)

Parses HTML and returns extracted content in a Response struct. domain is appended to the front of routes found in <a> that resemble "/cats" or "/cats/blogs".

func GetRobots(rawURL string) ([]byte, error)

Makes a GET request for a website's robots.txt file and returns it as a slice of bytes. Strict checks are done before returning.

type Rules struct {
    Allowed []string
    Disallowed []string
    Delay int
}

func ParseRobots(normURL string, textFile []byte) (Rules, error)

Parses the robots.txt file and returns extracted rules in a Rules struct. The normalized version of the URL is inserted at the front of routes.

CheckAbility(visited map[string]struct{}, rules Rules, normURL string) bool

Checks if the crawler can and should crawl the given URL.

func CheckDomain(domain *url.URL, rawURL string) (bool, error)

Checks if the crawler is still within the same domain.

Planned extensions

I'm not much of a UI guy if you can't tell from the commit history, if you would like to wrap this around a UI, feel free to fork the repo.

These are the extension I have planned.

  • Site map crawling.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL