wbot

package module
v0.1.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 27, 2023 License: MIT Imports: 11 Imported by: 1

README

WBot

WBot is a configurable, thread-safe web crawler written in Go. It offers a clean and minimal API for crawling and downloading web pages.

Features:

📦 Clean Minimal API: Easy-to-use API that gets you up and running in no time. ⚙️ Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation. 🚀 High Performance: Memory-efficient and designed for multi-threaded tasks. 🔌 Extensible: Provides built-in interfaces for Fetcher, Store, Queue, and Logger.

Examples & API

Configurations:

WBot can be configured using the following options:

WithParallel(parallel int) Option
WithMaxDepth(maxDepth int32) Option
WithUserAgents(userAgents []string) Option
WithProxies(proxies []string) Option
WithRateLimit(rates ...*wbot.RateLimit) Option
WithFilter(rules ...*wbot.FilterRule) Option
WithFetcher(fetcher wbot.Fetcher) Option
WithStore(store wbot.Store) Option
WithLogger(logger wbot.Logger) Option

WBot APIs:

You can interact with WBot using the following methods:

Start(links ...string)
OnReponse(fn func(*wbot.Response))
OnError(fn func(err error))
Stats() map[string]any
Stop()

Quick Start

package main

import (
	"github.com/twiny/wbot"
	"github.com/twiny/wbot/crawler"
)

func main() {
	bot := crawler.New(
		crawler.WithParallel(10),
		crawler.WithMaxDepth(10),
	)

	bot.OnReponse(func(resp *wbot.Response) {
		_ = resp
	})

	bot.OnError(func(err error) {
		_ = err
	})

	bot.Start(
		"https://www.github.com/",
		"https://crawler-test.com/",
		"https://www.warriorforum.com/",
	)
}

TODO

  • Add support for robots.txt.
  • Add test cases.
  • Implement Fetch using Chromedp.
  • Add more examples.
  • Add documentation.

Bugs

Bugs or suggestions? Please visit the issue tracker.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FindLinks(body []byte) (hrefs []string)

Types

type Fetcher

type Fetcher interface {
	Fetch(ctx context.Context, req *Request) (*Response, error)
	Close() error
}

type FilterRule added in v0.1.6

type FilterRule struct {
	Hostname string           `json:"hostname"`
	Allow    []*regexp.Regexp `json:"allow"`
	Disallow []*regexp.Regexp `json:"disallow"`
}

type Log added in v0.1.6

type Log struct {
	RequestURL   string        `json:"request_url"`
	Status       int           `json:"status"`
	Depth        int32         `json:"depth"`
	Err          error         `json:"err"`
	Timestamp    time.Time     `json:"timestamp"`
	ResponseTime time.Duration `json:"response_time"`
	ContentSize  int64         `json:"content_size"`
	UserAgent    string        `json:"user_agent"`
	RedirectURL  string        `json:"redirect_url"`
}

type Logger

type Logger interface {
	Write(ctx context.Context, log *Log) error
	Close() error
}

type MetricsMonitor added in v0.1.6

type MetricsMonitor interface {
	IncTotalRequests()
	IncSuccessfulRequests()
	IncFailedRequests()
	IncRetries()
	IncRedirects()

	IncTotalPages()
	IncCrawledPages()
	IncSkippedPages()
	IncParsedLinks()

	IncClientErrors()
	IncServerErrors()
}

type Param added in v0.1.4

type Param struct {
	Proxy       string `json:"proxy"`
	UserAgent   string `json:"user_agent"`
	Referer     string `json:"referer"`
	MaxBodySize int64  `json:"max_body_size"`
}

type ParsedURL added in v0.1.6

type ParsedURL struct {
	Hash string   `json:"hash"`
	Root string   `json:"root"`
	URL  *url.URL `json:"url"`
}

func NewURL added in v0.1.6

func NewURL(raw string) (*ParsedURL, error)

type Queue

type Queue interface {
	Push(ctx context.Context, req *Request) error
	Pop(ctx context.Context) (*Request, error)
	Len() int32
	Cancel()
	IsDone() bool
	Close() error
}

type RateLimit added in v0.1.6

type RateLimit struct {
	Hostname string `json:"hostname"`
	Rate     string `json:"rate"`
}

type Request

type Request struct {
	Target *ParsedURL `json:"target"`
	Param  *Param     `json:"param"`
	Depth  int32      `json:"depth"`
}

func (*Request) ResolveURL added in v0.1.6

func (r *Request) ResolveURL(u string) (*url.URL, error)

type Response

type Response struct {
	URL         *ParsedURL    `json:"url"`
	Status      int           `json:"status"`
	Body        []byte        `json:"-"`
	NextURLs    []*ParsedURL  `json:"next_urls"`
	Depth       int32         `json:"depth"`
	ElapsedTime time.Duration `json:"elapsed_time"`
	Err         error         `json:"-"`
}

type Store

type Store interface {
	HasVisited(ctx context.Context, u *ParsedURL) (bool, error)
	Close() error
}

Directories

Path Synopsis
plugin

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL