jsonextract

package module
v1.1.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2021 License: MIT Imports: 9 Imported by: 3

README

Tests Go Reference

jsonextract

jsonextract is a Go library for extracting JSON and JavaScript objects from any source. It can be used for data extraction tasks like web scraping.

If any text looks like a JavaScript object or is close looking like JSON, it will be converted to it.

Examples

Here is an example program that extracts all JSON objects from a file and prints them to the console:

package main

import (
	"fmt"
	"log"
	"os"

	"github.com/xarantolus/jsonextract"
)

func main() {
	file, err := os.Open("file.html")
	if err != nil {
		log.Fatalln(err.Error())
	}
	defer file.Close()

	// Print all JSON objects and arrays found in file.html
	err = jsonextract.Reader(file, func(b []byte) error {
		// Here you can parse the JSON data using a normal parser, e.g. from "encoding/json"
		// If you want to continue with the next object, return nil
		// To stop after this object, you can return jsonextract.ErrStop

		// len(b) > 0 will always be true

		// But here, we just print the data
		fmt.Println(string(b))

		return nil
	})
	if err != nil {
		log.Fatalln(err.Error())
	}
}
Extractor program

There's a small extractor program that uses this library to get data from URLs and files.

If you want to give it a try, you can just go-get it:

go get -u github.com/xarantolus/jsonextract/cmd/jsonx

You can use it on files or URLs, e.g. like this:

jsonx reader_test.go

or on URLs like this:

jsonx "https://stackoverflow.com/users/5728357/xarantolus?tab=topactivity"
Other examples

There are also examples in the examples subdirectory.

The string example shows how to use the package to quickly get all JSON objects/arrays in a string, it uses a strings.Reader for that.

The stackoverflow-chart example shows how to extract the reputation chart data of a StackOverflow user. Extracted data is then used to draw the same chart using Go:

Comparing chart from StackOverflow and the scraped and drawn result

Supported notations

This software supports not just extracting normal JSON, but also other JavaScript notation.

This means that text like the following, which is definitely not valid JSON, can also be extracted to an object:

{
	// Keys without quotes are valid in JavaScript, but not in JSON
	key: "value",
	num: 295.2,

	// Comments are removed while processing

	// Mixing normal and quoted keys is possible 
	"obj": {
		"quoted": 325,
		'other quotes': true,
		unquoted: 'test', // This trailing comma will be removed
	},

	`lastvalue`: `multiline strings are
no problem`
}

results in

{"key":"value","num":295.2,"obj":{"quoted":325,"other quotes":true,"unquoted":"test"},"lastvalue":"multiline strings are\nno problem"}
Notes
  • After passing the io.Reader to functions of this package, you should no longer use it afterwards. It might be read to the end, but in cases of stopping (using ErrStop) some data might remain in the reader.
  • When extracting objects from JavaScript files, you can end up with many arrays that look like [0], [1], ["i"], which is a result of indices being used in the script. You have to filter these out yourself.
Changelog
  • v1.1.11: No longer stop the lexer from reading too much, as that didn't work that good
  • v1.1.10: Stops the JS lexer from reading all data from input at once, prevents expensive resets
  • v1.1.9: JS Regex patterns are now returned as strings
  • v1.1.8: Fix bug where template literals were interpreted the wrong way when certain escape sequences were present
  • v1.1.7: More efficient extraction when a trailing comma is found
  • v1.1.6: Always return the correct error
  • v1.1.5: Small clarification on the callback
  • v1.1.4: Support trailing commas in arrays and objects
  • v1.1.3: Many small internal changes
  • v1.1.2: Also support JS template strings
  • v1.1.1: Also turn single-quoted strings into valid JSON
  • v1.1.0: Now supports anything that looks like JSON, which also includes JavaScript object declarations
  • v1.0.0: Initial version, supports only JSON
License

This is free as in freedom software. Do whatever you like with it.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrStop can be returned from a JSONCallback function to indicate that processing should stop at this object
	ErrStop = errors.New("stop processing json")
)

Functions

func Reader

func Reader(reader io.Reader, callback JSONCallback) (err error)

Reader reads all JSON and JavaScript objects from the input and calls callback for each of them. If callback returns an error, Reader will stop processing and return the error. If the returned error is ErrStop, Reader will return nil instead of the error. Please note that reader must return UTF-8 bytes, if you're not sure use the charset.NewReader method to convert to the correct charset (https://pkg.go.dev/golang.org/x/net/html/charset#NewReader)

func ReaderObjects

func ReaderObjects(reader io.Reader) (objects []json.RawMessage, err error)

ReaderObjects takes the given io.Reader and reads all possible JSON and JavaScript objects it can find

Types

type JSONCallback

type JSONCallback func([]byte) error

JSONCallback is the callback function passed to Reader. Found JSON objects will be passed to it as bytes. If this function returns an error, processing will stop and return that error. If the returned error is ErrStop, processing will stop but not return an error.

Directories

Path Synopsis
cmd
jsonx command
examples
readme command
string command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL