wbot

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2024 License: MIT Imports: 11 Imported by: 1

README

WBot

A configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.

Features

  • Clean minimal API.
  • Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
  • Memory-efficient, thread-safe.
  • Provides built-in interface: Fetcher, Store, Queue & a Logger.

API

WBot provides a minimal API for crawling web pages.

Run(links ...string) error
OnReponse(fn func(*wbot.Response))
Metrics() map[string]int64
Shutdown()

Usage

package main

import (
 "fmt"
 "log"

 "github.com/rs/zerolog"
 "github.com/twiny/wbot"
 "github.com/twiny/wbot/crawler"
)

func main() {
 bot := crawler.New(
  crawler.WithParallel(50),
  crawler.WithMaxDepth(5),
  crawler.WithRateLimit(&wbot.RateLimit{
   Hostname: "*",
   Rate:     "10/1s",
  }),
  crawler.WithLogLevel(zerolog.DebugLevel),
 )
 defer bot.Shutdown()

 // read responses
 bot.OnReponse(func(resp *wbot.Response) {
  fmt.Printf("crawled: %s\n", resp.URL.String())
 })

 if err := bot.Run(
  "https://crawler-test.com/",
 ); err != nil {
  log.Fatal(err)
 }

 log.Printf("finished crawling\n")
}

Bugs

Bugs or suggestions? Please visit the issue tracker.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindLinks(body []byte) (hrefs []string)

func Hostname added in v0.2.0

func Hostname(link string) (string, error)

Types

type Fetcher

type Fetcher interface {
	Fetch(ctx context.Context, req *Request) (*Response, error)
	Close() error
}

type FilterRule added in v0.1.6

type FilterRule struct {
	Hostname string
	Allow    []*regexp.Regexp
	Disallow []*regexp.Regexp
}

type MetricsMonitor added in v0.1.6

type MetricsMonitor interface {
	IncTotalRequests()
	IncSuccessfulRequests()
	IncFailedRequests()

	IncTotalLink()
	IncCrawledLink()
	IncSkippedLink()
	IncDuplicatedLink()

	Metrics() map[string]int64
}

type Param added in v0.1.4

type Param struct {
	Proxy       string
	UserAgent   string
	Referer     string
	MaxBodySize int64
	Timeout     time.Duration
}

type ParsedURL added in v0.1.6

type ParsedURL struct {
	Hash string
	Root string
	URL  *url.URL
}

func NewURL added in v0.1.6

func NewURL(raw string) (*ParsedURL, error)

func (*ParsedURL) String added in v0.2.0

func (u *ParsedURL) String() string

type Queue

type Queue interface {
	Push(ctx context.Context, req *Request) error
	Pop(ctx context.Context) (*Request, error)
	Len() int32
	Close() error
}

type RateLimit added in v0.1.6

type RateLimit struct {
	Hostname string
	Rate     string
}

type Request

type Request struct {
	Target *ParsedURL
	Param  *Param
	Depth  int32
}

func (*Request) ResolveURL added in v0.1.6

func (r *Request) ResolveURL(u string) (*url.URL, error)

type Response

type Response struct {
	URL         *ParsedURL
	Status      int
	Body        []byte
	NextURLs    []*ParsedURL
	Depth       int32
	ElapsedTime time.Duration
	Err         error
}

type Store

type Store interface {
	HasVisited(ctx context.Context, u *ParsedURL) (bool, error)
	Close() error
}

Directories

Path Synopsis
plugin

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL