arachne

command module

v0.0.0-...-2ab5e6a Latest Latest Go to latest Published: Jun 1, 2023 License: MIT Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/insprac/arachne

Links

Open Source Insights

README ¶

Arachne

Arachne is a web crawler implemented in Go, capable of crawling multiple URLs and saving the contents as Markdown files. This tool is perfect for those looking to build a local repository of web pages for offline reading or for performing subsequent data analysis.

Features

Scrape any webpage and save the content as Markdown files.
It allows you to specify starting URLs and the allowed URL prefixes for crawling.
Utilizes concurrency for faster scraping.
Automatically handles the conversion of HTML content to Markdown.
Respects and parses relative and absolute URLs.
Removes unnecessary elements from the webpage like scripts, styles, headers, footers, navigation, and sidebars to focus on the main content.

Usage

After cloning this repository and navigating into the project directory, you can run the program using:

go run main.go \
    --start "https://www.example.com,https://www.another-example.com" \
    --allow-prefix "https://www.example.com,https://www.another-example.com/some/path" \
    --out "/path/to/output"

The program accepts the following command-line arguments:

--start: A comma-separated list of URLs from which to start crawling.
--allow-prefix: A comma-separated list of URL prefixes which the crawler is allowed to visit.
--out: The directory where the Markdown files will be saved.

Note

Please use this tool responsibly, respecting the robots.txt files and not overloading servers. Be mindful of your network usage as well, as web scraping can consume a lot of bandwidth depending on the size of the websites you are scraping.

Contribution

This is an open-source project. Feel free to fork and make any contributions you feel will enhance the functionality of this web scraper.

License

This project is licensed under the MIT License - see the LICENSE file for details.

arachne

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
crawler

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL