grab

command module
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 20, 2022 License: MIT Imports: 4 Imported by: 0

README

GRAB GRAB

Greedy, Regex-Aware Binary Downloader

Stargazers Latest Release Codecov GitHub issues

Table of contents

Installation

Download and install the latest release

Usage

Let's start fresh. Run the following command to generate a new configuration file in the current directory

grab config generate

The file grab.hcl should be located in your home directory, or in any parent directory from where you will call the command.

The language of the file is Hashicorp Configuration Language

Read more about the configuration options here

Once you're happy with your configuration, you can check if everything is ok by running

grab config check

or, if your file is not located in a parent directory from your current working directory, you can always specify its path with the --config option.

grab config check -c /var/grab.hcl

Now you can start using grab. To scrape and download assets use the grab command and pass at least one url or a file containing a list of urls.

Note
The list of urls can contain comments, like the ini format, all lines starting with # and ; will be ignored

# single URL
grab get https://url.to/scrape/files?from

# list of URLs
grab get urls.ini

# at least one of each
grab get https://my.url/and urls.ini list.ini

Configuration

Take this example configuration:

global {
    location = "/home/user/Downloads/grab"
}

site "example" {
    test = "example\\.com"

    asset "image" {
        pattern  = "<img src=\"([^\"]+)\""
        find_all = true
        capture  = 1
    }

    info "editor" {
        pattern = "editor:\\s@(\\w+)"
        capture = 1
    }

    subdirectory {
        pattern = "gallery\\/(?<id>\\d+)\/"
        capture = "id"
        from    = url
    }
}

Let's pass our hypothetical url to grab get

grab get https://example.com/gallery/1337/overview
Downloading assets

The program will check if our url matches with any site block using the test pattern. If the pattern matches, the program will fetch the page body to scrape its contents.

asset "video" {
    pattern  = "<video src=\"(?P<videourl>[^\"]+)\""
    capture  = "videourl"
    find_all = true  # optional
}

Note
To escape double quotes, you must use one backslash: \"
To escape common regex expressions like \d you should escape twice: \\d

For each asset block, grab will search for matches using the pattern regex and then extract the capture group from the matches. By default only the first match will be extracted, if you wish to extract multiple urls from the same page, you can set find_all to true. Finally all the files will be downloaded from the extracted urls.

Example of extracted urls with the onfiguration above:

https://cdn.example.com/img/image1.jpg
https://cdn.example.com/img/image2.jpg
https://cdn.example.com/img/image3.jpg
Indexing data

After the assets, all the info blocks are evaluated and information is extracted from the page and will be stored in a _info.json file.

info "phone" {
    pattern  = "tel:(\d+)"
    capture  = 1
}

Inside the info file, two additional properties will be set by default: url and timestamp, representing the page url where the information has been scraped from, and the current time.

Example _info.json output:

{
  "url": "https://example.com/gallery/1337/overview",
  "timestamp": "2022-08-17T13:51:58.7265822Z",
  "editor": "everdrone"
}

By default, grab creates a subdirectory with the site name (in this case example) to store the information downloaded from this site. If you want to create separate subdirectories under example you can specify a subdirectory block.

Subdirectories

The subdirectory block will extract a string using pattern and capture just like other blocks, but you can specify the from attribute to tell grab to search inside the url or inside the body

subdirectory {
    pattern = "href=\"\\/\\@(?P<user>[^\"]+)"
    capture = "user"
    from    = body  # defaults to url
}

The final path of the assets will be <global.location>/<site.name>/<subdirectory>/<filename>

Example of destinations from the configuration above:

/home/user/Downloads/grab/example/1337/image1.jpg

Similarly, the _info.json file will be saved to /home/user/Downloads/grab/example/1337/_info.json

If no subdirectory block is specified, the asset destination will conform to: <global.location>/<site.name>/<filename>

Warning
If the pattern attribute contains named groups, you must set the capture attribute to get the named capture.

Use an integer capture groups only if your pattern does not contain named groups.
To learn more about Go's regexp syntax see the official documentation.

Network options

If a site requires specific headers to be set, or a number of retries, you can add optional network blocks to your configuration file.

network {
    # all attributes are optional
    retries = 3
    timeout = 10000  # in milliseconds
    headers = {
        "User-Agent" = "Mozilla/5.0 ..."
    }
}

network blocks can be located in the global block, inside site blocks and even asset blocks.

By default, the global.network configuration will be inherited to all sites and all site assets. To avoid inheriting the network configuration of a parent block, you can set inherit = false like so:

site "example" {
    # ...

    network {
        inherit = false
    }

    # ...
}

To learn more about advanced configuration patterns, see Advanced Configuration

Command Options

get
Arguments

Accepts both urls or path to lists of urls. Both can be provided at the same time.

# grab get <url|file> [url|file...] [options]

grab get https://example.com/gallery/1 \
         https://example.com/gallery/2 \
         path/to/list.ini \
         other/file.ini -n
Options
Long Short Default Description
force f false Overwrites already existing files
config c nil Specify the path to a configuration file
strict s false Will stop the program at the first encountered error
dry-run n false Will send requests without writing to the disk
progress p false Show a progress bar
quiet q false Suppress all output to stdout (errors will still be printed to stderr)
This option takes precedence over verbose
verbose v 1 Set the verbosity level.
-v is 1, -vv is 2 and so on...
quiet overrides this option.

Next steps

  • Retries & Timeout
  • Network options inheritance
  • URL manipulation
  • Destination manipulation
  • Display progress bar
  • Better logging
  • Add HCL eval context functions
  • Distribute via various package managers:
    • Homebrew
    • Apt
    • Chocolatey
    • Scoop
  • Scripting language integration
  • Plugins?
  • Sequential jobs (like GitHub workflows)

License

Distributed under the MIT License.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
internal
net

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL