extract

module

v0.0.0-...-d08fc69 Latest Latest Go to latest Published: Sep 6, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ivanvanderbyl/extract

Links

Open Source Insights

README ¶

`extract`

A tool for extracting structured data from unstructured documents. Designed around the Unix philosophy of building simple, modular programs that can be easily connected together to perform more complex tasks.

Installation

# TBA

Usage

extract supports two modes of operation: extract and infer. The extract mode is used to extract structured data from unstructured documents, while the infer mode is used to infer the structure of a document as a JSON schema, which can then be used to extract structured data from similar documents.

Extract

The extract mode is used to extract structured data from unstructured documents. It takes a document and a JSON schema as input, and outputs the structured data extracted from the document.

extract run --schema <schema> <document>

It also supports reading the document from stdin:

cat <document> | extract run --schema <schema>
curl -s https://somewebsite.com | extract run --schema <schema>

And writing the structured data to stdout:

extract run --schema <schema> <document> > structured_data.json
extract run --schema <schema> <document> | jq ".data"

Infer

The infer mode is used to infer the structure of a document as a JSON schema. It takes a document as input, and outputs a JSON schema that describes the structure of the document.

Note: The infer mode is still experimental and may not work as expected. You may need to manually edit the inferred schema to get the desired results.

extract infer <document>

It also supports reading the document from stdin:

cat <document> | extract infer
curl -s https://somewebsite.com | extract infer

Directories ¶

Path	Synopsis
cmd
commands/extract
pkg
content
infer

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL