flyscrape

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 29, 2023 License: Apache-2.0 Imports: 22 Imported by: 0

README

flyscrape

flyscrape is an elegant scraping tool for efficiently extracting data from websites. Whether you're a developer, data analyst, or researcher, flyscrape empowers you to effortlessly gather information from web pages and transform it into structured data.

Features

  • Simple and Intuitive: flyscrape offers an easy-to-use command-line interface that allows you to interact with scraping scripts effortlessly.

  • Create New Scripts: The new command enables you to generate sample scraping scripts quickly, providing you with a solid starting point for your scraping endeavors.

  • Run Scripts: Execute your scraping script using the run command, and watch as flyscrape retrieves and processes data from the specified website.

  • Watch for Development: The watch command allows you to watch your scraping script for changes and quickly iterate during development, helping you find the right data extraction queries.

Installation

To install flyscrape, follow these simple steps:

  1. Install Go: Make sure you have Go installed on your system. If not, you can download it from https://golang.org/.

  2. Install flyscrape: Open a terminal and run the following command:

    go install github.com/philippta/flyscrape/cmd/flyscrape@latest
    

Usage

flyscrape offers several commands to assist you in your scraping journey:

Creating a New Script

Use the new command to create a new scraping script:

flyscrape new example.js
Running a Script

Execute your scraping script using the run command:

flyscrape run example.js
Watching for Development

The watch command allows you to watch your scraping script for changes and quickly iterate during development:

flyscrape watch example.js

Example Script

Below is an example scraping script that showcases the capabilities of flyscrape:

import { parse } from 'flyscrape';

export const options = {
    url: 'https://news.ycombinator.com/',     // Specify the URL to start scraping from.
    depth: 1,                                 // Specify how deep links should be followed.  (default = 0, no follow)
    allowedDomains: [],                       // Specify the allowed domains. ['*'] for all. (default = domain from url)
    blockedDomains: [],                       // Specify the blocked domains.                (default = none)
    allowedURLs: [],                          // Specify the allowed URLs as regex.          (default = all allowed)
    blockedURLs: [],                          // Specify the blocked URLs as regex.          (default = non blocked)
    proxy: '',                                // Specify the HTTP(S) proxy to use.           (default = no proxy)
    rate: 100,                                // Specify the rate in requests per second.    (default = 100)
}

export default function({ html, url }) {
    const $ = parse(html);
    const title = $('title');
    const entries = $('.athing').toArray();

    if (!entries.length) {
        return null; // Omits scraped pages without entries.
    }

    return {
        title: title.text(),                                            // Extract the page title.
        entries: entries.map(entry => {                                 // Extract all news entries.
            const link = $(entry).find('.titleline > a');
            const rank = $(entry).find('.rank');
            const points = $(entry).next().find('.score');

            return {
                title: link.text(),                                     // Extract the title text.
                url: link.attr('href'),                                 // Extract the link href.
                rank: parseInt(rank.text().slice(0, -1)),               // Extract and cleanup the rank.
                points: parseInt(points.text().replace(' points', '')), // Extract and cleanup the points.
            }
        }),
    };
}

Contributing

We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please submit an issue.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var StopWatch = errors.New("stop watch")

Functions

func Compile

func Compile(src string) (ScrapeOptions, ScrapeFunc, error)

func PrettyPrint

func PrettyPrint(v any, prefix string) string

func Print

func Print(v any, prefix string) string

func Watch

func Watch(path string, fn func(string) error) error

Types

type FetchFunc

type FetchFunc func(url string) (string, error)

func CachedFetch

func CachedFetch(fetch FetchFunc) FetchFunc

func Fetch

func Fetch() FetchFunc

func ProxiedFetch

func ProxiedFetch(proxyURL string) FetchFunc

type ScrapeFunc

type ScrapeFunc func(ScrapeParams) (any, error)

type ScrapeOptions

type ScrapeOptions struct {
	URL            string   `json:"url"`
	AllowedDomains []string `json:"allowedDomains"`
	BlockedDomains []string `json:"blockedDomains"`
	AllowedURLs    []string `json:"allowedURLs"`
	BlockedURLs    []string `json:"blockedURLs"`
	Proxy          string   `json:"proxy"`
	Depth          int      `json:"depth"`
	Rate           float64  `json:"rate"`
}

type ScrapeParams

type ScrapeParams struct {
	HTML string
	URL  string
}

type ScrapeResult

type ScrapeResult struct {
	URL       string    `json:"url"`
	Data      any       `json:"data,omitempty"`
	Links     []string  `json:"-"`
	Error     error     `json:"error,omitempty"`
	Timestamp time.Time `json:"timestamp"`
}

type Scraper

type Scraper struct {
	ScrapeOptions ScrapeOptions
	ScrapeFunc    ScrapeFunc
	FetchFunc     FetchFunc
	// contains filtered or unexported fields
}

func (*Scraper) Scrape

func (s *Scraper) Scrape() <-chan ScrapeResult

type TransformError

type TransformError struct {
	Line   int
	Column int
	Text   string
}

func (TransformError) Error

func (err TransformError) Error() string

Directories

Path Synopsis
cmd
flyscrape command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL