Greedy, Regex-Aware Binary Downloader
Table of contents
Installation
Download and install the latest release
Usage
Let's start fresh. Run the following command to generate a new configuration file in the current directory
grab config generate
The file grab.hcl should be located in your home directory, or in any parent directory from where you will call the command.
The language of the file is Hashicorp Configuration Language
Read more about the configuration options here
Once you're happy with your configuration, you can check if everything is ok by running
grab config check
or, if your file is not located in a parent directory from your current working directory, you can always specify its path with the --config option.
grab config check -c /var/grab.hcl
Now you can start using grab.
To scrape and download assets use the grab command and pass at least one url or a file containing a list of urls.
Note
The list of urls can contain comments, like the ini format, all lines starting with # and ; will be ignored
# single URL
grab get https://url.to/scrape/files?from
# list of URLs
grab get urls.ini
# at least one of each
grab get https://my.url/and urls.ini list.ini
Configuration
Take this example configuration:
global {
location = "/home/user/Downloads/grab"
}
site "example" {
test = "example\\.com"
asset "image" {
pattern = "<img src=\"([^\"]+)\""
find_all = true
capture = 1
}
info "editor" {
pattern = "editor:\\s@(\\w+)"
capture = 1
}
subdirectory {
pattern = "gallery\\/(?<id>\\d+)\/"
capture = "id"
from = url
}
}
Let's pass our hypothetical url to grab get
grab get https://example.com/gallery/1337/overview
Downloading assets
The program will check if our url matches with any site block using the test pattern. If the pattern matches, the program will fetch the page body to scrape its contents.
asset "video" {
pattern = "<video src=\"(?P<videourl>[^\"]+)\""
capture = "videourl"
find_all = true # optional
}
Note
To escape double quotes, you must use one backslash: \"
To escape common regex expressions like \d you should escape twice: \\d
For each asset block, grab will search for matches using the pattern regex and then extract the capture group from the matches.
By default only the first match will be extracted, if you wish to extract multiple urls from the same page, you can set find_all to true. Finally all the files will be downloaded from the extracted urls.
Example of extracted urls with the onfiguration above:
https://cdn.example.com/img/image1.jpg
https://cdn.example.com/img/image2.jpg
https://cdn.example.com/img/image3.jpg
Indexing data
After the assets, all the info blocks are evaluated and information is extracted from the page and will be stored in a _info.json file.
info "phone" {
pattern = "tel:(\d+)"
capture = 1
}
Inside the info file, two additional properties will be set by default: url and timestamp, representing the page url where the information has been scraped from, and the current time.
Example _info.json output:
{
"url": "https://example.com/gallery/1337/overview",
"timestamp": "2022-08-17T13:51:58.7265822Z",
"editor": "everdrone"
}
By default, grab creates a subdirectory with the site name (in this case example) to store the information downloaded from this site.
If you want to create separate subdirectories under example you can specify a subdirectory block.
Subdirectories
The subdirectory block will extract a string using pattern and capture just like other blocks, but you can specify the from attribute to tell grab to search inside the url or inside the body
subdirectory {
pattern = "href=\"\\/\\@(?P<user>[^\"]+)"
capture = "user"
from = body # defaults to url
}
The final path of the assets will be <global.location>/<site.name>/<subdirectory>/<filename>
Example of destinations from the configuration above:
/home/user/Downloads/grab/example/1337/image1.jpg
Similarly, the _info.json file will be saved to /home/user/Downloads/grab/example/1337/_info.json
If no subdirectory block is specified, the asset destination will conform to: <global.location>/<site.name>/<filename>
Warning
If the pattern attribute contains named groups, you must set the capture attribute to get the named capture.
Use an integer capture groups only if your pattern does not contain named groups.
To learn more about Go's regexp syntax see the official documentation.
Network options
If a site requires specific headers to be set, or a number of retries, you can add optional network blocks to your configuration file.
network {
# all attributes are optional
retries = 3
timeout = 10000 # in milliseconds
headers = {
"User-Agent" = "Mozilla/5.0 ..."
}
}
network blocks can be located in the global block, inside site blocks and even asset blocks.
By default, the global.network configuration will be inherited to all sites and all site assets. To avoid inheriting the network configuration of a parent block, you can set inherit = false like so:
site "example" {
# ...
network {
inherit = false
}
# ...
}
To learn more about advanced configuration patterns, see Advanced Configuration
Command Options
get
Arguments
Accepts both urls or path to lists of urls. Both can be provided at the same time.
# grab get <url|file> [url|file...] [options]
grab get https://example.com/gallery/1 \
https://example.com/gallery/2 \
path/to/list.ini \
other/file.ini -n
Options
| Long |
Short |
Default |
Description |
force |
f |
false |
Overwrites already existing files |
config |
c |
nil |
Specify the path to a configuration file |
strict |
s |
false |
Will stop the program at the first encountered error |
dry-run |
n |
false |
Will send requests without writing to the disk |
progress |
p |
false |
Show a progress bar |
quiet |
q |
false |
Suppress all output to stdout (errors will still be printed to stderr) This option takes precedence over verbose |
verbose |
v |
1 |
Set the verbosity level.
-v is 1, -vv is 2 and so on...
quiet overrides this option. |
Next steps
- Retries & Timeout
- Network options inheritance
- URL manipulation
- Destination manipulation
- Display progress bar
- Better logging
- Add HCL eval context functions
- Distribute via various package managers:
- Homebrew
- Apt
- Chocolatey
- Scoop
- Scripting language integration
- Plugins?
- Sequential jobs (like GitHub workflows)
License
Distributed under the MIT License.