grab

command module

v0.1.3 Latest Latest Go to latest Published: Aug 20, 2022 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/everdrone/grab

Links

Open Source Insights

README ¶

Greedy, Regex-Aware Binary Downloader

Installation

Download and install the latest release

Usage

Let's start fresh. Run the following command to generate a new configuration file in the current directory

grab config generate

The file grab.hcl should be located in your home directory, or in any parent directory from where you will call the command.

The language of the file is Hashicorp Configuration Language

Read more about the configuration options here

Once you're happy with your configuration, you can check if everything is ok by running

grab config check

or, if your file is not located in a parent directory from your current working directory, you can always specify its path with the --config option.

grab config check -c /var/grab.hcl

Now you can start using grab. To scrape and download assets use the grab command and pass at least one url or a file containing a list of urls.

Note
The list of urls can contain comments, like the ini format, all lines starting with # and ; will be ignored

# single URL
grab get https://url.to/scrape/files?from

# list of URLs
grab get urls.ini

# at least one of each
grab get https://my.url/and urls.ini list.ini

Configuration

Take this example configuration:

global {
    location = "/home/user/Downloads/grab"
}

site "example" {
    test = "example\\.com"

    asset "image" {
        pattern  = "<img src=\"([^\"]+)\""
        find_all = true
        capture  = 1
    }

    info "editor" {
        pattern = "editor:\\s@(\\w+)"
        capture = 1
    }

    subdirectory {
        pattern = "gallery\\/(?<id>\\d+)\/"
        capture = "id"
        from    = url
    }
}

Let's pass our hypothetical url to grab get

grab get https://example.com/gallery/1337/overview

Downloading assets

The program will check if our url matches with any site block using the test pattern. If the pattern matches, the program will fetch the page body to scrape its contents.

asset "video" {
    pattern  = "<video src=\"(?P<videourl>[^\"]+)\""
    capture  = "videourl"
    find_all = true  # optional
}

Note
To escape double quotes, you must use one backslash: \"
To escape common regex expressions like \d you should escape twice: \\d

For each asset block, grab will search for matches using the pattern regex and then extract the capture group from the matches. By default only the first match will be extracted, if you wish to extract multiple urls from the same page, you can set find_all to true. Finally all the files will be downloaded from the extracted urls.

Example of extracted urls with the onfiguration above:
https://cdn.example.com/img/image1.jpg
https://cdn.example.com/img/image2.jpg
https://cdn.example.com/img/image3.jpg

Indexing data

After the assets, all the info blocks are evaluated and information is extracted from the page and will be stored in a _info.json file.

info "phone" {
    pattern  = "tel:(\d+)"
    capture  = 1
}

Inside the info file, two additional properties will be set by default: url and timestamp, representing the page url where the information has been scraped from, and the current time.

Example _info.json output:

{
  "url": "https://example.com/gallery/1337/overview",
  "timestamp": "2022-08-17T13:51:58.7265822Z",
  "editor": "everdrone"
}

By default, grab creates a subdirectory with the site name (in this case example) to store the information downloaded from this site. If you want to create separate subdirectories under example you can specify a subdirectory block.

Subdirectories

The subdirectory block will extract a string using pattern and capture just like other blocks, but you can specify the from attribute to tell grab to search inside the url or inside the body

subdirectory {
    pattern = "href=\"\\/\\@(?P<user>[^\"]+)"
    capture = "user"
    from    = body  # defaults to url
}

The final path of the assets will be <global.location>/<site.name>/<subdirectory>/<filename>

Example of destinations from the configuration above:
/home/user/Downloads/grab/example/1337/image1.jpg
Similarly, the _info.json file will be saved to /home/user/Downloads/grab/example/1337/_info.json

If no subdirectory block is specified, the asset destination will conform to: <global.location>/<site.name>/<filename>

Warning
If the pattern attribute contains named groups, you must set the capture attribute to get the named capture.

Use an integer capture groups only if your pattern does not contain named groups.
To learn more about Go's regexp syntax see the official documentation.

Network options

If a site requires specific headers to be set, or a number of retries, you can add optional network blocks to your configuration file.

network {
    # all attributes are optional
    retries = 3
    timeout = 10000  # in milliseconds
    headers = {
        "User-Agent" = "Mozilla/5.0 ..."
    }
}

network blocks can be located in the global block, inside site blocks and even asset blocks.

By default, the global.network configuration will be inherited to all sites and all site assets. To avoid inheriting the network configuration of a parent block, you can set inherit = false like so:

site "example" {
    # ...

    network {
        inherit = false
    }

    # ...
}

To learn more about advanced configuration patterns, see Advanced Configuration

Command Options

`get`

Arguments

Accepts both urls or path to lists of urls. Both can be provided at the same time.

# grab get <url|file> [url|file...] [options]

grab get https://example.com/gallery/1 \
         https://example.com/gallery/2 \
         path/to/list.ini \
         other/file.ini -n

Options

Long	Short	Default	Description
`force`	`f`	`false`	Overwrites already existing files
`config`	`c`	`nil`	Specify the path to a configuration file
`strict`	`s`	`false`	Will stop the program at the first encountered error
`dry-run`	`n`	`false`	Will send requests without writing to the disk
`progress`	`p`	`false`	Show a progress bar
`quiet`	`q`	`false`	Suppress all output to `stdout` (errors will still be printed to `stderr`) This option takes precedence over `verbose`
`verbose`	`v`	`1`	Set the verbosity level. `-v` is 1, `-vv` is 2 and so on... `quiet` overrides this option.