flyscrape is a standalone and scriptable web scraper, combining the speed of Go with the flexibility of JavaScript. — Focus on data extraction rather than request juggling.
flyscrape is available for MacOS, Linux and Windows as a downloadable binary from the releases page.
Compile from source
To compile flyscrape from source, follow these steps:
Install Go: Make sure you have Go installed on your system. If not, you can download it from https://go.dev/.
Install flyscrape: Open a terminal and run the following command:
go install github.com/philippta/flyscrape/cmd/flyscrape@latest
Usage
Usage:
flyscrape run SCRIPT [config flags]
Examples:
# Run the script.
$ flyscrape run example.js
# Set the URL as argument.
$ flyscrape run example.js --url "http://other.com"
# Enable proxy support.
$ flyscrape run example.js --proxy "http://someproxy:8043"
# Follow paginated links.
$ flyscrape run example.js --depth 5 --follow ".next-button > a"
Configuration
Below is an example scraping script that showcases the capabilities of flyscrape. For a full documentation of all configuration options, visit the documentation page.
export const config = {
// Specify the URL to start scraping from.
url: "https://example.com/",
// Specify the multiple URLs to start scraping from. (default = [])
urls: [
"https://anothersite.com/",
"https://yetanother.com/",
],
// Specify how deep links should be followed. (default = 0, no follow)
depth: 5,
// Speficy the css selectors to follow. (default = ["a[href]"])
follow: [".next > a", ".related a"],
// Specify the allowed domains. ['*'] for all. (default = domain from url)
allowedDomains: ["example.com", "anothersite.com"],
// Specify the blocked domains. (default = none)
blockedDomains: ["somesite.com"],
// Specify the allowed URLs as regex. (default = all allowed)
allowedURLs: ["/posts", "/articles/\d+"],
// Specify the blocked URLs as regex. (default = none)
blockedURLs: ["/admin"],
// Specify the rate in requests per minute. (default = no rate limit)
rate: 60,
// Specify the number of concurrent requests. (default = no limit)
concurrency: 1,
// Specify a single HTTP(S) proxy URL. (default = no proxy)
proxy: "http://someproxy.com:8043",
// Specify multiple HTTP(S) proxy URLs. (default = no proxy)
proxies: [
"http://someproxy.com:8043",
"http://someotherproxy.com:8043",
],
// Enable file-based request caching. (default = no cache)
cache: "file",
// Specify the HTTP request header. (default = none)
headers: {
"Authorization": "Bearer ...",
"User-Agent": "Mozilla ...",
},
};
export function setup() {
// Optional setup function, called once before scraping starts.
// Can be used for authentication.
}
export default function ({ doc, url, absoluteURL }) {
// doc - Contains the parsed HTML document
// url - Contains the scraped URL
// absoluteURL(...) - Transforms relative URLs into absolute URLs
}
import { download } from "flyscrape/http";
download("http://example.com/image.jpg") // downloads as "image.jpg"
download("http://example.com/image.jpg", "other.jpg") // downloads as "other.jpg"
download("http://example.com/image.jpg", "dir/") // downloads as "dir/image.jpg"
// If the server offers a filename via the Content-Disposition header and no
// destination filename is provided, Flyscrape will honor the suggested filename.
// E.g. `Content-Disposition: attachment; filename="archive.zip"`
download("http://example.com/generate_archive.php", "dir/") // downloads as "dir/archive.zip"
Issues and Suggestions
If you encounter any issues or have suggestions for improvement, please submit an issue.