gowebcrawler

package module

v0.0.0-...-77334f8 Latest Latest Go to latest Published: Jun 1, 2015 License: MIT Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cgenuity/gowebcrawler

Links

Open Source Insights

README ¶

gowebcrawler

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

TODO

Better logging and error handling

USAGE

See example usage here

Documentation ¶

Overview ¶

gowebcrawler is a concurrent Web Crawler that generates a JSON sitemap for a given root URL

Index ¶

func GetAttributesFromDocument(doc *goquery.Document) (links []string, assets []string)
type Crawler
type Page
type PageMessage
type Parser
type UrlParser
- func (u UrlParser) Parse(url string) (links []string, assets []string, err error)
type WebCrawler
- func (w WebCrawler) Crawl(url string) ([]byte, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetAttributesFromDocument ¶

func GetAttributesFromDocument(doc *goquery.Document) (links []string, assets []string)

Gets slices of links and assets from a goquery.Document

Types ¶

type Crawler ¶

type Crawler interface {
	Crawl(string, parser Parser) ([]byte, error)
}

type Page ¶

type Page struct {
	Url      string
	Assets   []string
	Links    []string
	Children map[string]*Page
	// contains filtered or unexported fields
}

A Page represents a web page's relation to other pages and the data needed to make a site map showing assets it depends on

type PageMessage ¶

type PageMessage struct {
	Page  *Page
	Error error
	Url   string
}

type Parser ¶

type Parser interface {
	Parse(string) (links []string, assets []string, err error)
}

type UrlParser ¶

type UrlParser struct{}

UrlParser implements Parser to extract relevant data from a page at a given URL

func (UrlParser) Parse ¶

func (u UrlParser) Parse(url string) (links []string, assets []string, err error)

Grabs links and assets from a page at a URL

type WebCrawler ¶

type WebCrawler struct {
	Parser     *UrlParser
	RootUrl    string
	FetchLimit int
}

WebCrawler implements Crawler and generates a JSON site map from a starting domain and path. It takes care to not crawl other domains or get the same page more than once. Also supports a FetchLimit to limit total fetches made.

func (WebCrawler) Crawl ¶

func (w WebCrawler) Crawl(url string) ([]byte, error)

Starts crawling from a given URL or path.

Source Files ¶

View all Source files

crawler.go

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL