jsonextract

package module

v1.3.1 Latest Latest Go to latest Published: Feb 22, 2021 License: MIT Imports: 11 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/xarantolus/jsonextract

Links

Open Source Insights

README ¶

jsonextract

jsonextract is a Go library for extracting JSON and JavaScript objects from any source. It can be used for data extraction tasks like web scraping.

If any text looks like a JavaScript object or is close looking like JSON, it will be converted to it.

Examples

Here is an example program that extracts all JSON objects from a file and prints them to the console:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/xarantolus/jsonextract"
)

func main() {
	file, err := os.Open("file.html")
	if err != nil {
		log.Fatalln(err.Error())
	}
	defer file.Close()

	// Print all JSON objects and arrays found in file.html
	err = jsonextract.Reader(file, func(b []byte) error {
		// Here you can parse the JSON data using a normal parser, e.g. from "encoding/json"
		// If you want to continue with the next object, return nil
		// To stop after this object, you can return jsonextract.ErrStop

		// len(b) > 0 will always be true

		// But here, we just print the data
		fmt.Println(string(b))

		return nil
	})
	if err != nil {
		log.Fatalln(err.Error())
	}
}

Extractor program

There's a small extractor program that uses this library to get data from URLs and files.

If you want to give it a try, you can just go-get it:

go get -u github.com/xarantolus/jsonextract/cmd/jsonx

You can use it on files or URLs, e.g. like this:

jsonx reader_test.go

or on URLs like this:

jsonx "https://stackoverflow.com/users/5728357/xarantolus?tab=topactivity"

Other examples

There are also examples in the examples subdirectory.

The string example shows how to use the package to quickly get all JSON objects/arrays in a string, it uses a strings.Reader for that.

The stackoverflow-chart example shows how to extract the reputation chart data of a StackOverflow user. Extracted data is then used to draw the same chart using Go:

Comparing chart from StackOverflow and the scraped and drawn result

Supported notations

This software supports not just extracting normal JSON, but also other JavaScript notation.

This means that text like the following, which is definitely not valid JSON, can also be extracted to an object:

{
	// Keys without quotes are valid in JavaScript, but not in JSON
	key: "value",
	num: 295.2,

	// Comments are removed while processing

	// Mixing normal and quoted keys is possible 
	"obj": {
		"quoted": 325,
		'other quotes': true,
		unquoted: 'test', // This trailing comma will be removed
	},

	// JSON doesn't support all these number formats
	"dec": 21,
	"hex": 0x15,
	"oct": 0o25,
	"bin": 0b10101,
	bigint: 21n,

	// Undefined will be interpreted as null
	"udef": undefined,

	`lastvalue`: `multiline strings are
no problem`
}

results in

{"key":"value","num":295.2,"obj":{"quoted":325,"other quotes":true,"unquoted":"test"},"dec":21,"hex":21,"oct":21,"bin":21,"bigint":21,"udef":null,"lastvalue":"multiline strings are\nno problem"

Notes

While the functions take an io.Reader and stream data from it without buffering everything in memory, the underlying JS lexer uses ioutil.ReadAll. That means that this doesn't work well on files that are larger than memory.
When extracting objects from JavaScript files, you can end up with many arrays that look like [0], [1], ["i"], which is a result of indices being used in the script. You have to filter these out yourself.
While this package supports most number formats, there are some that don't work because the lexer doesn't support them. One of them are underscores in numbers, e.g. in JS 2175 can be written as 2_175 or 0x8_7_f, but that doesn't work here. Another example are numbers with a leading zero; they are rejected by the lexer because it's not clear if they should be interpreted as octal or decimal.

Changelog

v1.3.1: Support more number formats by transforming them to decimal numbers, which are valid in JSON
v1.3.0: Return to non-streaming version that worked with all objects, the streaming version seemed to skip certain parts and thus wasn't very great
v1.2.0: Fork the JS lexer and make it use the underlying streaming lexer that was already in that package. That's a bit faster and prevents many unnecessary resets. This also makes it possible to extract from very large files with a small memory footprint.
v1.1.11: No longer stop the lexer from reading too much, as that didn't work that good
v1.1.10: Stops the JS lexer from reading all data from input at once, prevents expensive resets
v1.1.9: JS Regex patterns are now returned as strings
v1.1.8: Fix bug where template literals were interpreted the wrong way when certain escape sequences were present
v1.1.7: More efficient extraction when a trailing comma is found
v1.1.6: Always return the correct error
v1.1.5: Small clarification on the callback
v1.1.4: Support trailing commas in arrays and objects
v1.1.3: Many small internal changes
v1.1.2: Also support JS template strings
v1.1.1: Also turn single-quoted strings into valid JSON
v1.1.0: Now supports anything that looks like JSON, which also includes JavaScript object declarations
v1.0.0: Initial version, supports only JSON

Thanks

Thanks to everyone who made the parse package possible. Without it, creating this extractor would have been a lot harder.

License

This is free as in freedom software. Do whatever you like with it.

Documentation ¶

Index ¶

Variables
func Reader(reader io.Reader, callback JSONCallback) (err error)
func ReaderObjects(reader io.Reader) (objects []json.RawMessage, err error)
type JSONCallback

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	// ErrStop can be returned from a JSONCallback function to indicate that processing should stop at this object
	ErrStop = errors.New("stop processing json")
)

Functions ¶

func Reader ¶

func Reader(reader io.Reader, callback JSONCallback) (err error)

Reader reads all JSON and JavaScript objects from the input and calls callback for each of them. If callback returns an error, Reader will stop processing and return the error. If the returned error is ErrStop, Reader will return nil instead of the error. Please note that reader must return UTF-8 bytes, if you're not sure use the charset.NewReader method to convert to the correct charset (https://pkg.go.dev/golang.org/x/net/html/charset#NewReader)

func ReaderObjects ¶

func ReaderObjects(reader io.Reader) (objects []json.RawMessage, err error)

ReaderObjects takes the given io.Reader and reads all possible JSON and JavaScript objects it can find

Types ¶

type JSONCallback ¶

type JSONCallback func([]byte) error

JSONCallback is the callback function passed to Reader. Found JSON objects will be passed to it as bytes. If this function returns an error, processing will stop and return that error. If the returned error is ErrStop, processing will stop but not return an error.

Source Files ¶

View all Source files

reader.go

Directories ¶

Path	Synopsis
cmd
jsonx command
examples
readme command
string command
stackoverflow-chart module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL