span-crossref-snapshot

command
v0.1.345 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 21, 2022 License: GPL-3.0 Imports: 18 Imported by: 0

Documentation

Overview

Given as single file with crossref works API messages, create a potentially smaller file, which contains only the most recent version of each document.

Works in a three stage, two pass fashion: (1) extract, (2) identify, (3) extract. Performance data point (30M compressed records, 11m33.871s):

2017/07/24 18:26:10 stage 1: 8m13.799431646s 2017/07/24 18:26:55 stage 2: 45.746997314s 2017/07/24 18:29:30 stage 3: 2m34.23537293s

$ span-crossref-snapshot -z crossref.ndj.gz -o out.ndj.gz

TODO: externalize decompression, which seems to slow things down; only about 10K docs/s when running parallel.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL