flyscrape

package module

v0.1.0 Latest Latest Go to latest Published: Aug 29, 2023 License: Apache-2.0 Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/philippta/flyscrape

Links

Open Source Insights

README ¶

flyscrape

flyscrape is an elegant scraping tool for efficiently extracting data from websites. Whether you're a developer, data analyst, or researcher, flyscrape empowers you to effortlessly gather information from web pages and transform it into structured data.

Features

Simple and Intuitive: flyscrape offers an easy-to-use command-line interface that allows you to interact with scraping scripts effortlessly.
Create New Scripts: The new command enables you to generate sample scraping scripts quickly, providing you with a solid starting point for your scraping endeavors.
Run Scripts: Execute your scraping script using the run command, and watch as flyscrape retrieves and processes data from the specified website.
Watch for Development: The watch command allows you to watch your scraping script for changes and quickly iterate during development, helping you find the right data extraction queries.

Installation

To install flyscrape, follow these simple steps:

Install Go: Make sure you have Go installed on your system. If not, you can download it from https://golang.org/.

Install flyscrape: Open a terminal and run the following command:

go install github.com/philippta/flyscrape/cmd/flyscrape@latest

Usage

flyscrape offers several commands to assist you in your scraping journey:

Creating a New Script

Use the new command to create a new scraping script:

flyscrape new example.js

Running a Script

Execute your scraping script using the run command:

flyscrape run example.js

Watching for Development

The watch command allows you to watch your scraping script for changes and quickly iterate during development:

flyscrape watch example.js

Example Script

Below is an example scraping script that showcases the capabilities of flyscrape:

import { parse } from 'flyscrape';

export const options = {
    url: 'https://news.ycombinator.com/',     // Specify the URL to start scraping from.
    depth: 1,                                 // Specify how deep links should be followed.  (default = 0, no follow)
    allowedDomains: [],                       // Specify the allowed domains. ['*'] for all. (default = domain from url)
    blockedDomains: [],                       // Specify the blocked domains.                (default = none)
    allowedURLs: [],                          // Specify the allowed URLs as regex.          (default = all allowed)
    blockedURLs: [],                          // Specify the blocked URLs as regex.          (default = non blocked)
    proxy: '',                                // Specify the HTTP(S) proxy to use.           (default = no proxy)
    rate: 100,                                // Specify the rate in requests per second.    (default = 100)
}

export default function({ html, url }) {
    const $ = parse(html);
    const title = $('title');
    const entries = $('.athing').toArray();

    if (!entries.length) {
        return null; // Omits scraped pages without entries.
    }

    return {
        title: title.text(),                                            // Extract the page title.
        entries: entries.map(entry => {                                 // Extract all news entries.
            const link = $(entry).find('.titleline > a');
            const rank = $(entry).find('.rank');
            const points = $(entry).next().find('.score');

            return {
                title: link.text(),                                     // Extract the title text.
                url: link.attr('href'),                                 // Extract the link href.
                rank: parseInt(rank.text().slice(0, -1)),               // Extract and cleanup the rank.
                points: parseInt(points.text().replace(' points', '')), // Extract and cleanup the points.
            }
        }),
    };
}

Contributing

We welcome contributions from the community! If you encounter any issues or have suggestions for improvement, please submit an issue.

Documentation ¶

Index ¶

Variables
func Compile(src string) (ScrapeOptions, ScrapeFunc, error)
func PrettyPrint(v any, prefix string) string
func Print(v any, prefix string) string
func Watch(path string, fn func(string) error) error
type FetchFunc
type ScrapeFunc
type ScrapeOptions
type ScrapeParams
type ScrapeResult
type Scraper
- func (s *Scraper) Scrape() <-chan ScrapeResult
type TransformError
- func (err TransformError) Error() string

Constants ¶

This section is empty.

Variables ¶

View Source

var StopWatch = errors.New("stop watch")

Functions ¶

func Compile ¶

func Compile(src string) (ScrapeOptions, ScrapeFunc, error)

func PrettyPrint ¶

func PrettyPrint(v any, prefix string) string

func Print ¶

func Print(v any, prefix string) string

func Watch ¶

func Watch(path string, fn func(string) error) error

Types ¶

type FetchFunc ¶

type FetchFunc func(url string) (string, error)

func CachedFetch ¶

func CachedFetch(fetch FetchFunc) FetchFunc

func Fetch ¶

func Fetch() FetchFunc

func ProxiedFetch ¶

func ProxiedFetch(proxyURL string) FetchFunc

type ScrapeFunc ¶

type ScrapeFunc func(ScrapeParams) (any, error)

type ScrapeOptions ¶

type ScrapeOptions struct {
	URL            string   `json:"url"`
	AllowedDomains []string `json:"allowedDomains"`
	BlockedDomains []string `json:"blockedDomains"`
	AllowedURLs    []string `json:"allowedURLs"`
	BlockedURLs    []string `json:"blockedURLs"`
	Proxy          string   `json:"proxy"`
	Depth          int      `json:"depth"`
	Rate           float64  `json:"rate"`
}

type ScrapeParams ¶

type ScrapeParams struct {
	HTML string
	URL  string
}

type ScrapeResult ¶

type ScrapeResult struct {
	URL       string    `json:"url"`
	Data      any       `json:"data,omitempty"`
	Links     []string  `json:"-"`
	Error     error     `json:"error,omitempty"`
	Timestamp time.Time `json:"timestamp"`
}

type Scraper ¶

type Scraper struct {
	ScrapeOptions ScrapeOptions
	ScrapeFunc    ScrapeFunc
	FetchFunc     FetchFunc
	// contains filtered or unexported fields
}

func (*Scraper) Scrape ¶

func (s *Scraper) Scrape() <-chan ScrapeResult

type TransformError ¶

type TransformError struct {
	Line   int
	Column int
	Text   string
}

func (TransformError) Error ¶

func (err TransformError) Error() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
flyscrape command
js

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL